# Pandas

Pandas is an important library for Data Science & Analytics. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. Pandas can work well with data from a wide variety of sources such as; Excel sheet, csv file, sql file or even a webpage

Doucumentation: https://pandas.pydata.org/pandas-docs/version/0.25.3/

## Importing Pandas

In [1]:
import pandas as pd

## Pandas Series

Pandas series are one-dimensional labeled array that are capable of holding of data of any type

In [62]:
#creating panda series from list

lst = ['a', 'b', 'c', 2, 'e']
ser_1 = pd.Series(lst)

print (ser_1)

0    a
1    b
2    c
3    2
4    e
dtype: object


In [4]:
type(ser_1)

pandas.core.series.Series

In [63]:
#creating pandas series from array

import numpy as np
arr = np.array([10, 20, 30, 40, 50])
ser_2 = pd.Series(arr)

print(ser_2)

0    10
1    20
2    30
3    40
4    50
dtype: int32


In [64]:
#create series with specific indexing

ser_3 = pd.Series(arr, index = ['sarah', 'bob', 'alex', 'den', 'nancy'])

print (ser_3)

sarah    10
bob      20
alex     30
den      40
nancy    50
dtype: int32


In [8]:
print(ser_3.values)
print(ser_3.index)

[10 20 30 40 50]
Index(['sarah', 'bob', 'alex', 'den', 'nancy'], dtype='object')


In [10]:
# Attributes

print(' Size:',ser_3.size,'\n',
'Shape:',ser_3.shape,'\n',
'Dimension:',ser_3.ndim,'\n',
'Data type:',ser_3.dtype)

 Size: 5 
 Shape: (5,) 
 Dimension: 1 
 Data type: int32


In [16]:
# slicing by implicit integer index

print(ser_3[0:3])

sarah    10
bob      20
alex     30
dtype: int32


In [17]:
# slicing by explicit index

ser_3['sarah':'alex']

sarah    10
bob      20
alex     30
dtype: int32

Notice that when slicing with an explicit index (i.e., ser_3['sarah':'alex']), the final index is included in the slice, while when slicing with an implicit index (i.e., ser_3[0:3]), the final index is excluded from the slice.

## Creating a Series

In [66]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [19]:
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

## Pandas Dataframe

Pandas DataFrame is a two-dimensional labeled data structure.

In [67]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [68]:
states = pd.DataFrame({'population': population})
states

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [74]:
pet_info = [['Jack',2,"Parrot"],["Lucy",3,"Cat"], ["Cinco",8,"Rabbot"]]
pet_df = pd.DataFrame(pet_info, columns=["Name","Age","Type"])

pet_df

Unnamed: 0,Name,Age,Type
0,Jack,2,Parrot
1,Lucy,3,Cat
2,Cinco,8,Rabbot


In [22]:
# Datafram using Python 2d list

pet_info = [['Blain', 10, 'Dog'], ['Lucy', 4 , 'Cat'], ['Cinco', 8, 'Rabbit']]
pet_df = pd.DataFrame(pet_info, columns = ['Name', 'Age', 'Type'])

pet_df

Unnamed: 0,Name,Age,Type
0,Blain,10,Dog
1,Lucy,4,Cat
2,Cinco,8,Rabbit


In [75]:
cities_info = {"Cities":["Karachi","Islamabad","Lahore"],"Population":[12124124314,232143412,1321431241],"Area":[2131241241,4532523,214412312]}
cities_df = pd.DataFrame(cities_info)

cities_df

Unnamed: 0,Cities,Population,Area
0,Karachi,12124124314,2131241241
1,Islamabad,232143412,4532523
2,Lahore,1321431241,214412312


In [76]:
cities_df["Cities"]

0      Karachi
1    Islamabad
2       Lahore
Name: Cities, dtype: object

## Dataframe Functions

There are some functions and attributes that allow us to observe basic information about the data stored in a DataFrame object:

DataFrame.head() -> returns the content of the first 5 rows, by default

DataFrame.tail() -> returns the content of the last 5 rows, by default

DataFrame.shape -> returns a tuple of the form (num_rows, num_columns)

DataFrame.columns -> returns the name of the columns

DataFrame.index -> returns the index of the rows

In [77]:
population_dict = {'States':['California','Texas','New York','Florida','Illinois','Chicago','Utah'],"Population":[38332521,26448193,19651127,19552860,12882135,235821531,2148149] }
population_df = pd.DataFrame(population_dict)

In [78]:
population_df

Unnamed: 0,States,Population
0,California,38332521
1,Texas,26448193
2,New York,19651127
3,Florida,19552860
4,Illinois,12882135
5,Chicago,235821531
6,Utah,2148149


In [81]:
population_df.head()

Unnamed: 0,States,Population
0,California,38332521
1,Texas,26448193
2,New York,19651127
3,Florida,19552860
4,Illinois,12882135


In [37]:
population_df.tail()

Unnamed: 0,States,Population
2,New York,19651127
3,Florida,19552860
4,Illinois,12882135
5,Chicago,235821531
6,Utah,2148149


In [38]:
population_df.shape

(7, 2)

In [39]:
population_df.index

RangeIndex(start=0, stop=7, step=1)

In [40]:
population_df.columns

Index(['States', 'Population'], dtype='object')

## Adding a new column

In [42]:
cities_df

Unnamed: 0,Cities,Population,Area
0,Karachi,12124124314,2131241241
1,Islamabad,232143412,4532523
2,Lahore,1321431241,214412312


In [85]:
cities_df['density'] = cities_df['Population']/cities_df['Area']

In [86]:
cities_df

Unnamed: 0,Cities,Population,Area,density
0,Karachi,12124124314,2131241241,5.688762
1,Islamabad,232143412,4532523,51.217261
2,Lahore,1321431241,214412312,6.163038


## Other Functions

In [46]:
cities_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Cities      3 non-null      object 
 1   Population  3 non-null      int64  
 2   Area        3 non-null      int64  
 3   density     3 non-null      float64
dtypes: float64(1), int64(2), object(1)
memory usage: 224.0+ bytes


In [47]:
cities_df.describe()

Unnamed: 0,Population,Area,density
count,3.0,3.0,3.0
mean,4559233000.0,783395400.0,21.02302
std,6573988000.0,1171976000.0,26.150054
min,232143400.0,4532523.0,5.688762
25%,776787300.0,109472400.0,5.9259
50%,1321431000.0,214412300.0,6.163038
75%,6722778000.0,1172827000.0,28.690149
max,12124120000.0,2131241000.0,51.217261


In [93]:
cities_df.isnull()

Unnamed: 0,Cities,Population,Area,density,new
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,True


In [94]:
cities_df.isnull().sum()

Cities        0
Population    0
Area          0
density       0
new           1
dtype: int64

### Applying Custom Functions

In [100]:
#The following example passes a function and checks the value of each element in series and returns low, normal or High accordingly.

def population_range(num): 
    if num < 1000000: 
        return "Low"
    elif num >= 1000000 and num < 9999999999: 
        return "Normal"
    else: 
        return "High"

In [58]:
population = cities_df['Population']

In [59]:
# passing function to apply and storing returned series in new_col

new_col = population.apply(population_range)

In [60]:
cities_df['Population Category'] = new_col

In [61]:
cities_df

Unnamed: 0,Cities,Population,Area,density,Population Category
0,Karachi,12124124314,2131241241,5.688762,High
1,Islamabad,232143412,4532523,51.217261,Normal
2,Lahore,1321431241,214412312,6.163038,Normal
