![pandas.png](media/pandas.png)

---

## Python Pandas For Absolute Beginners






### What is Pandas?
- Pandas is a Python library used in data science and data analytics.
- It has functions and methods that are used for exploratory analysis and data manipulations.




### Why Learn and Use Pandas?
1. Pandas allows Data Scientists to import, analyze and explore data.
2. Pandas is used for data pre-processing, especially in data cleaning.
3. Pandas provides Data Scientists with some statistical inferences on data.
4. Pandas is easy to learn 




[Pandas Official Website](https://pandas.pydata.org/)

### Install Pandas


` pip install pandas `




In [3]:
# Importing Pandas
import pandas as pd

In [2]:
# Check version of Pandas
print(pd.__version__)


2.3.3


# Series and DataFrame

## What is a Series?
- A Pandas Series is a 1-D array holding data of any type.

- It is like a column in a table or matrix

## What is a DataFrame? 

- When Dataset is multi-dimensional, they are stored in a structure called DataFrames.

- If a Series is like a column, then, the DataFrame is the whole table

![SeriesPandas.png](media/SeriesPandas.png)

## Series


In [4]:
# Create Pandas Series from a Python List
data = [1, 2, 3, 4]
data

[1, 2, 3, 4]

In [5]:
type(data)

list

In [12]:
x = pd.Series(data)
x

0    1
1    2
2    3
3    4
dtype: int64

In [11]:
type(x)

pandas.core.series.Series

In [13]:
# Create Pandas Series with Index
x = pd.Series(data, index=['Mon', 'Tue', 'Wed', 'Thur'])
x

Mon     1
Tue     2
Wed     3
Thur    4
dtype: int64

In [16]:
# Create Pandas Series from a Python Dictionary
data = {
    'name' : 'Davis',
    'email' : 'opokudavis141@gmail.com',
    'age' : 20
}

In [17]:
data

{'name': 'Davis', 'email': 'opokudavis141@gmail.com', 'age': 20}

In [18]:
y = pd.Series(data)

In [19]:
y

name                       Davis
email    opokudavis141@gmail.com
age                           20
dtype: object

##  DataFrame



In [20]:
# Create Pandas DataFrame
data = [1, 2, 3, 4]

In [21]:
data

[1, 2, 3, 4]

In [22]:
z = pd.DataFrame(data)

In [23]:
z

Unnamed: 0,0
0,1
1,2
2,3
3,4


In [24]:
type(z)

pandas.core.frame.DataFrame

In [25]:
data = [[10, 20, 30], [22, 11, 77]]

In [31]:
z = pd.DataFrame(data, columns=['Jan', 'Feb', 'Mar'], index=['Week1', 'Week2'])

In [34]:
# Passing in data using loc method (adding data, grab a particular index)
z = pd.DataFrame(data, columns=['Jan', 'Feb', 'Mar'])

In [35]:
z

Unnamed: 0,Jan,Feb,Mar
0,10,20,30
1,22,11,77


In [36]:
z.loc[2] = [34, 55, 76]

In [37]:
z

Unnamed: 0,Jan,Feb,Mar
0,10,20,30
1,22,11,77
2,34,55,76


In [38]:
# Create Pandas DataFrame from a Python Dict
data = {
    'Monday' : [10, 20, 30],
    'Tuesday' : [100, 200, 300],
    'Wednesday' : [33, 44, 77]
}

In [39]:
data

{'Monday': [10, 20, 30], 'Tuesday': [100, 200, 300], 'Wednesday': [33, 44, 77]}

In [42]:
y = pd.DataFrame(data)
y

Unnamed: 0,Monday,Tuesday,Wednesday
0,10,100,33
1,20,200,44
2,30,300,77


In [47]:
# Using the loc attribute to return one or more specified row(s)
y.loc[[2, 0]]

Unnamed: 0,Monday,Tuesday,Wednesday
2,30,300,77
0,10,100,33


In [48]:
y.Monday or # y[column name]

0    10
1    20
2    30
Name: Monday, dtype: int64

In [49]:
y[['Monday', 'Tuesday']]

Unnamed: 0,Monday,Tuesday
0,10,100
1,20,200
2,30,300


###  Dataset and Data Sources
- Kaggle
- UCI Machine Learning Repository
- Experimental trials




#### Iris Dataset
[Iris Dataset from kaggle](https://www.kaggle.com/uciml/iris?select=Iris.csv) 

[Iris Dataset from UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)

![alt text](media/iris.png)

In [50]:
# Import from CSV file
dataset = pd.read_csv('Iris.csv')

In [51]:
# View Data
dataset

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [53]:
# Check Head --- Returns the first 5 rows of the DataFrame
dataset.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [54]:
# Check Tail --- Returns the last 5 rows of the DataFrame
dataset.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [55]:
# Check Shape --- Returns a tupple showing the number of rows and columns
dataset.shape

(150, 6)

In [56]:
type(dataset.shape)

tuple

In [57]:
# Check Info --- Returns basic information on the DataFrame
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


## Pandas - Data Pre-Processing and Cleaning


Data cleaning means fixing errors or bad data in your data set. This is a pre-processing activity that needs to be carried out before using the dataset

Bad dataset could be a combination of:

- Empty cells or null values
- Data in wrong format
- Wrong data
- Duplicates


In [67]:
data = pd.read_csv('Iris_modified.csv')

In [68]:
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,,Iris-setosa
1,2,4.9,3.0,1.4,,Iris-setosa
2,2,4.9,3.0,1.4,,Iris-setosa
3,3,4.7,3.2,,0.2,Iris-setosa
4,4,4.6,3.1,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.0,5.2,2.3,Iris-virginica
150,147,6.3,,5.0,1.9,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica
152,149,6.2,3.4,5.4,2.3,Iris-virginica


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
4,4,4.6,3.1,1.5,0.2,Iris-setosa
5,5,5.0,3.6,1.4,0.2,Iris-setosa
6,6,5.4,3.9,1.7,0.4,Iris-setosa
8,8,5.0,3.4,1.5,0.2,Iris-setosa
9,9,4.4,2.9,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
147,144,6.8,3.2,5.9,2.3,Iris-virginica
148,145,6.7,3.3,5.7,2.5,Iris-virginica
149,146,6.7,3.0,5.2,2.3,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica


In [60]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             154 non-null    int64  
 1   SepalLengthCm  153 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  152 non-null    float64
 4   PetalWidthCm   147 non-null    float64
 5   Species        154 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.3+ KB


In [71]:
# Remove Null Values -- dropna()
x = data.dropna()
x

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
4,4,4.6,3.1,1.5,0.2,Iris-setosa
5,5,5.0,3.6,1.4,0.2,Iris-setosa
6,6,5.4,3.9,1.7,0.4,Iris-setosa
8,8,5.0,3.4,1.5,0.2,Iris-setosa
9,9,4.4,2.9,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
147,144,6.8,3.2,5.9,2.3,Iris-virginica
148,145,6.7,3.3,5.7,2.5,Iris-virginica
149,146,6.7,3.0,5.2,2.3,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica


In [73]:
# Replace Null Values -- fillna()
y = data.fillna(200)

In [74]:
y

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,200.0,Iris-setosa
1,2,4.9,3.0,1.4,200.0,Iris-setosa
2,2,4.9,3.0,1.4,200.0,Iris-setosa
3,3,4.7,3.2,200.0,0.2,Iris-setosa
4,4,4.6,3.1,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.0,5.2,2.3,Iris-virginica
150,147,6.3,200.0,5.0,1.9,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica
152,149,6.2,3.4,5.4,2.3,Iris-virginica


In [75]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             154 non-null    int64  
 1   SepalLengthCm  154 non-null    float64
 2   SepalWidthCm   154 non-null    float64
 3   PetalLengthCm  154 non-null    float64
 4   PetalWidthCm   154 non-null    float64
 5   Species        154 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.3+ KB


In [77]:
data.PetalWidthCm

0      NaN
1      NaN
2      NaN
3      0.2
4      0.2
      ... 
149    2.3
150    1.9
151    2.0
152    2.3
153    1.8
Name: PetalWidthCm, Length: 154, dtype: float64

In [87]:
# Replace Null Values for Specific Columns-- fillna()
data['PetalWidthCm'] = data['PetalWidthCm'].fillna(700)
data['PetalLengthCm'] = data['PetalLengthCm'].fillna(200)
data['SepalWidthCm'] = data['SepalWidthCm'].fillna(500)

In [88]:
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,700.0,Iris-setosa
1,2,4.9,3.0,1.4,700.0,Iris-setosa
2,2,4.9,3.0,1.4,700.0,Iris-setosa
3,3,4.7,3.2,200.0,0.2,Iris-setosa
4,4,4.6,3.1,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.0,5.2,2.3,Iris-virginica
150,147,6.3,500.0,5.0,1.9,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica
152,149,6.2,3.4,5.4,2.3,Iris-virginica


In [7]:
# Replace Null Values Using Mean, Median or Mode -- fillna()
data = pd.read_csv('Iris_modified.csv')

In [8]:
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,,Iris-setosa
1,2,4.9,3.0,1.4,,Iris-setosa
2,2,4.9,3.0,1.4,,Iris-setosa
3,3,4.7,3.2,,0.2,Iris-setosa
4,4,4.6,3.1,1.5,0.2,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.0,5.2,2.3,Iris-virginica
150,147,6.3,,5.0,1.9,Iris-virginica
151,148,6.5,3.0,5.2,2.0,Iris-virginica
152,149,6.2,3.4,5.4,2.3,Iris-virginica


In [9]:
data.PetalLengthCm

0      1.4
1      1.4
2      1.4
3      NaN
4      1.5
      ... 
149    5.2
150    5.0
151    5.2
152    5.4
153    NaN
Name: PetalLengthCm, Length: 154, dtype: float64

In [10]:
mean_PL = data.PetalLengthCm.mean()
mean_PW = data.PetalWidthCm.mean()
mean_SL = data.SepalLengthCm.mean()
mean_SW = data.SepalWidthCm.mean()


In [11]:
print(f'PetalLengthCm: {mean_PL}')
print(f'PetalWidthCm: {mean_PW}')
print(f'SepalLengthCm: {mean_SL}')
print(f'SepalWidthCm: {mean_SW}')

PetalLengthCm: 3.746052631578947
PetalWidthCm: 1.195918367346939
SepalLengthCm: 5.827450980392157
SepalWidthCm: 3.0433333333333334


In [12]:
data['PetalWidthCm'] = data['PetalWidthCm'].fillna(mean_PW)
data['PetalLengthCm'] = data['PetalLengthCm'].fillna(mean_PL)
data['SepalWidthCm'] = data['SepalWidthCm'].fillna(mean_SW)
data['SepalLengthCm'] = data['SepalLengthCm'].fillna(mean_SL)

In [13]:
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.500000,1.400000,1.195918,Iris-setosa
1,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
2,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
3,3,4.7,3.200000,3.746053,0.200000,Iris-setosa
4,4,4.6,3.100000,1.500000,0.200000,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.000000,5.200000,2.300000,Iris-virginica
150,147,6.3,3.043333,5.000000,1.900000,Iris-virginica
151,148,6.5,3.000000,5.200000,2.000000,Iris-virginica
152,149,6.2,3.400000,5.400000,2.300000,Iris-virginica


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             154 non-null    int64  
 1   SepalLengthCm  154 non-null    float64
 2   SepalWidthCm   154 non-null    float64
 3   PetalLengthCm  154 non-null    float64
 4   PetalWidthCm   154 non-null    float64
 5   Species        154 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.3+ KB


### Exercise

#### Use the Median and Mode values of columns to replace missing values in their respective columns

In [None]:
#mode(), median(), mean()

In [106]:
# Remove Duplicates 

# duplicated 

# drop_duplicates


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.500000,1.400000,1.195918,Iris-setosa
1,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
2,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
3,3,4.7,3.200000,3.746053,0.200000,Iris-setosa
4,4,4.6,3.100000,1.500000,0.200000,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.000000,5.200000,2.300000,Iris-virginica
150,147,6.3,3.043333,5.000000,1.900000,Iris-virginica
151,148,6.5,3.000000,5.200000,2.000000,Iris-virginica
152,149,6.2,3.400000,5.400000,2.300000,Iris-virginica


In [107]:
data.duplicated()

0      False
1      False
2       True
3      False
4      False
       ...  
149    False
150    False
151    False
152    False
153    False
Length: 154, dtype: bool

In [15]:
data.drop_duplicates()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.500000,1.400000,1.195918,Iris-setosa
1,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
3,3,4.7,3.200000,3.746053,0.200000,Iris-setosa
4,4,4.6,3.100000,1.500000,0.200000,Iris-setosa
5,5,5.0,3.600000,1.400000,0.200000,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.000000,5.200000,2.300000,Iris-virginica
150,147,6.3,3.043333,5.000000,1.900000,Iris-virginica
151,148,6.5,3.000000,5.200000,2.000000,Iris-virginica
152,149,6.2,3.400000,5.400000,2.300000,Iris-virginica


## Pandas - Basic Data Analysis


Data analysis simple means getting an insight of data. It invloves using tool(s) such as Python, R, Excel, SQL and Libraries such as Pandas and Numpy to understand data. 

In this section, we will look at the following techniques in Data Analysis:

- Filtering
- Sorting
- Data Correlation



In [16]:
# Filtering
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.500000,1.400000,1.195918,Iris-setosa
1,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
2,2,4.9,3.000000,1.400000,1.195918,Iris-setosa
3,3,4.7,3.200000,3.746053,0.200000,Iris-setosa
4,4,4.6,3.100000,1.500000,0.200000,Iris-setosa
...,...,...,...,...,...,...
149,146,6.7,3.000000,5.200000,2.300000,Iris-virginica
150,147,6.3,3.043333,5.000000,1.900000,Iris-virginica
151,148,6.5,3.000000,5.200000,2.000000,Iris-virginica
152,149,6.2,3.400000,5.400000,2.300000,Iris-virginica


In [None]:
# Sorting


In [None]:
# Data Correlation
