<a href="https://colab.research.google.com/github/bgreat5/ML/blob/main/Python/Python_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***What is Pandas?***

Pandas is an open source library, providing high performance, easy to use data structures and data analysis tools for the Python language.

Name derived from "*Panel Data*". Panel Data are datasets that include multiple observations over multiple periods of time. 

Pandas is built on top of Numpy and steps on its computational abilities and array structure.

pandas.pydata.org

**Data Structures - Series and DataFrame**

Series - Single column data
DataFrame - Multiple column data - collection of series. Information is organised into rows and columns.

Anaconda prompt (or Terminal) is the tool that allows you to communicate with your operating system about the installation of different modules and packages.

In [None]:
!pip install pandas --upgrade
## Downloads the latest version

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

In [None]:
pd.__version__

'1.3.5'

Creating a series from a list

In [None]:
products = ['A','B','C','D']
products

['A', 'B', 'C', 'D']

In [None]:
type(products)

list

In [None]:
product_categories = pd.Series(products)
product_categories

0    A
1    B
2    C
3    D
dtype: object

In [None]:
type(product_categories)

pandas.core.series.Series

In [None]:
daily_rates_dollars = pd.Series([40, 45, 50 ,55])
daily_rates_dollars 

0    40
1    45
2    50
3    55
dtype: int64

Pandas series from Numpy

In [None]:
import numpy as np

In [None]:
array_a = np.array([10,20,30,40,50])
array_a

array([10, 20, 30, 40, 50])

In [None]:
type(array_a)

numpy.ndarray

In [None]:
series_a = pd.Series(array_a)
series_a

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [None]:
type(series_a)

pandas.core.series.Series

Pandas series is an powerful version of list or an enhanced verison of NumPy array.

Almost every entity in Python is an object.

A python object contain data, metadata and related functionalities.

A variable providing metadata about an object - *ATTRIBUTE*

Function associated with object - *METHOD*

Attributes are passive (only provide information about the object). Methods are active (in the sense that they actuaaly work on the data in the object)

In [None]:
series_a.dtype

dtype('int64')

In [None]:
series_a.size
## how many elements in a series

5

In [None]:
product_categories.name = 'Product Categories'

In [None]:
product_categories

0    A
1    B
2    C
3    D
Name: Product Categories, dtype: object

Indexing

In [None]:
prices_per_category = {'Product A':22250,'Product B':16600,'Product C':15600}
prices_per_category

{'Product A': 22250, 'Product B': 16600, 'Product C': 15600}

In [None]:
type(prices_per_category)

dict

In [None]:
prices_per_category = pd.Series(prices_per_category)

In [None]:
prices_per_category

Product A    22250
Product B    16600
Product C    15600
dtype: int64

In [None]:
prices_per_category.index

Index(['Product A', 'Product B', 'Product C'], dtype='object')

Label-based or Position-based indexing

When index of series object is a Range index object - Position Based Indexing/Zero Based Indexing

Label based indexing - Names that will logically correspond to the data values contained in a series.

In [None]:
series_a[0]

10

In [None]:
prices_per_category['Product A']

22250

Difference between Methods and Functions

Same as in - When provided with some initial data, both tools can make specific operations with it and return output.

Difference - Function is an independent entity whereas Method from a given package is applied to the object of a certain class.

**Methods for Pandas Series**

In [None]:
start_date_deposits = pd.Series({'1/1/2022':2000,'1/2/2022':3000,'1/3/2022':4000,'1/4/2022':5000,'1/5/2022':6000})
start_date_deposits

1/1/2022    2000
1/2/2022    3000
1/3/2022    4000
1/4/2022    5000
1/5/2022    6000
dtype: int64

In [None]:
start_date_deposits.sum()

20000

In [None]:
start_date_deposits.min()

2000

In [None]:
start_date_deposits.max()

6000

In [None]:
start_date_deposits.idxmax()

'1/5/2022'

In [None]:
start_date_deposits.idxmin()

'1/1/2022'

In [None]:
start_date_deposits.head(n=1)

1/1/2022    2000
dtype: int64

In [None]:
start_date_deposits.tail(1)

1/5/2022    6000
dtype: int64

Parameters - Associated with a given method. Parameters allows to modify the way in which the method will operate. 

Argument - Value passed for a parameter.

***Pandas DataFrame***

a tabular structure that contains multiple observations for a given set of variables.

collection of series. 2 dimensional data. contains rows and columns.

Creating pandas DataFrame from Scratch

In [None]:
# 1. DataFrame from a dictionary of Lists

data = {'ProductName':['Product A','Product B','Product C'],'ProductPrice':[22500,16600,12500]}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22500
1,Product B,16600
2,Product C,12500


In [None]:
# 2. DataFrame from a dictionary of Lists + specify an index

data = {'ProductName':['Product A','Product B','Product C'],'ProductPrice':[22500,16600,12500]}
df = pd.DataFrame(data,index=['A','B','C'])
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22500
B,Product B,16600
C,Product C,12500


In [None]:
#3 DataFrame from a list of Dictionaries

data = [{'ProductName':'Product A','ProductPrice':22500},
        {'ProductName':'Product B','ProductPrice':16600},
        {'ProductName':'Product C','ProductPrice':12500}]
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22500
1,Product B,16600
2,Product C,12500


In [None]:
#4 From dictionary of Pandas series

ser_products = pd.Series(['Product A','Product B','Product C'])
ser_prices = pd.Series([22500,16600,12500])

In [None]:
data = {'ProductName':ser_products,'ProductPrice':ser_prices}
df = pd.DataFrame(data)
df

Unnamed: 0,ProductName,ProductPrice
0,Product A,22500
1,Product B,16600
2,Product C,12500


In [None]:
#5 From a list of lists

data = [['Product A',22500],['Product B',16600],['Product C',12500]]
df = pd.DataFrame(data)
df

Unnamed: 0,0,1
0,Product A,22500
1,Product B,16600
2,Product C,12500


In [None]:
df.columns = ['ProductName','ProductPrice']
df.index = ['A','B','C']
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22500
B,Product B,16600
C,Product C,12500


In [None]:
#6 Professional way

df = pd.DataFrame(data=[['Product A',22500],['Product B',16600],['Product C',12500]],
                  columns= ['ProductName','ProductPrice'],
                  index= ['A','B','C'])
df

Unnamed: 0,ProductName,ProductPrice
A,Product A,22500
B,Product B,16600
C,Product C,12500


**Data Cleaning** : Applying certain analytical and programming techniques to convert an incomprehensible dataset into a meaningful and quality format we can use for further processing

Pandas methods

In [None]:
# unique() - unique values

In [None]:
# nunique() - number of unique values

In [None]:
prices_per_category = pd.Series({'Product A':22250,'Product B':16600,'Product C':15600})
prices_per_category

Product A    22250
Product B    16600
Product C    15600
dtype: int64

In [None]:
prices_per_category.values

array([22250, 16600, 15600])

In [None]:
type(prices_per_category.values)

numpy.ndarray

In [None]:
prices_per_category.array

<PandasArray>
[22250, 16600, 15600]
Length: 3, dtype: int64

In [None]:
type(prices_per_category.array)

pandas.core.arrays.numpy_.PandasArray

In [None]:
prices_per_category.to_numpy()

array([22250, 16600, 15600])

In [None]:
type(prices_per_category.to_numpy())

numpy.ndarray

All of the above 3 results in same array but according to the requirements we can use any one of the 3. (.values is not suggested by pandas documentation)

 .sort_values() method

In [None]:
numbers = pd.Series([100,50,200,150])
numbers

0    100
1     50
2    200
3    150
dtype: int64

In [None]:
numbers.sort_values()

1     50
0    100
3    150
2    200
dtype: int64

In [None]:
numbers.sort_values(ascending=True)

1     50
0    100
3    150
2    200
dtype: int64

In [None]:
numbers.sort_values(ascending=False)

2    200
3    150
0    100
1     50
dtype: int64

Chaining : Attribute and Method Chaining

Method Chaining: organizes and applies several method calls on a certain object in a given order.Each call performs its action and returns an intermediate output.

In [None]:
## Attribute chaining
numbers.index.name='Index'

In [None]:
numbers

Index
0    100
1     50
2    200
3    150
dtype: int64

In [None]:
## Method Chaining
numbers.sort_values().head(1)

Index
1    50
dtype: int64

In [None]:
numbers.sort_index(ascending=False)

Index
3    150
2    200
1     50
0    100
dtype: int64

In [None]:
numbers.sort_index(ascending=True)

Index
0    100
1     50
2    200
3    150
dtype: int64

Pandas DataFrame attributes

In [None]:
## df.index
## df.columns
## df.axes
## df.dtypes
## df.values
## df.shape

Data Selection: Extracting rows, columns, or subsets from such an object.

1. Indexing : 

In [None]:
netflix = pd.read_csv("https://raw.githubusercontent.com/practiceprobs/datasets/main/netflix-titles/netflix-titles.csv",index_col='show_id')
netflix.head(1)

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."


In [None]:
## Attribute type access
netflix.type

0         Movie
1       TV Show
2       TV Show
3       TV Show
4       TV Show
         ...   
8802      Movie
8803    TV Show
8804      Movie
8805      Movie
8806      Movie
Name: type, Length: 8807, dtype: object

In [None]:
## Indexing operator access
netflix['type']

0         Movie
1       TV Show
2       TV Show
3       TV Show
4       TV Show
         ...   
8802      Movie
8803    TV Show
8804      Movie
8805      Movie
8806      Movie
Name: type, Length: 8807, dtype: object

In [None]:
netflix[['type']]

Unnamed: 0,type
0,Movie
1,TV Show
2,TV Show
3,TV Show
4,TV Show
...,...
8802,Movie
8803,TV Show
8804,Movie
8805,Movie


**df.iloc** - Attribute indexor or accessor

i - Integer  loc- location 

Integer location based indexing for selection by position.

In [None]:
netflix.iloc[1]

show_id                                                        s2
type                                                      TV Show
title                                               Blood & Water
director                                                      NaN
cast            Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...
country                                              South Africa
date_added                                     September 24, 2021
release_year                                                 2021
rating                                                      TV-MA
duration                                                2 Seasons
listed_in         International TV Shows, TV Dramas, TV Mysteries
description     After crossing paths at a party, a Cape Town t...
Name: 1, dtype: object

In [None]:
netflix.iloc[1,1]

'TV Show'

In [None]:
netflix.iloc[[1,3],:]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."


**df.loc** - loc indexor/ loc accessor



In [None]:
netflix.loc['s10',:]

type                                                        Movie
title                                                The Starling
director                                           Theodore Melfi
cast            Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...
country                                             United States
date_added                                     September 24, 2021
release_year                                                 2021
rating                                                      PG-13
duration                                                  104 min
listed_in                                        Comedies, Dramas
description     A woman adjusting to life after a loss contend...
Name: s10, dtype: object

In [None]:
netflix['title']['s10']

'The Starling'

In [None]:
netflix['title'].unique()

array(['Dick Johnson Is Dead', 'Blood & Water', 'Ganglands', ...,
       'Zombieland', 'Zoom', 'Zubaan'], dtype=object)

In [None]:
array_a = np.array([[3,2,1],[6,3,2]])

In [None]:
pd.DataFrame(array_a)

Unnamed: 0,0,1,2
0,3,2,1
1,6,3,2


In [None]:
df = pd.DataFrame(array_a,columns={'A','B','C'})
df 

Unnamed: 0,C,A,B
0,3,2,1
1,6,3,2
