## Pandas-DataFrame and Series

Pandas is a powerful data manipulation library in Python, widely used for data analysis and data cleaning. It provides two primary data structures: Series and DataFrame. A Series is a one-dimentional array-like object, while a DataFrame is a two-dimentional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [13]:
!pip install numpy
!pip install pandas



In [1]:
import pandas as pd
import numpy as np

In [2]:
## Series
## A Pandas Series is a one dimentional array-like object that can hold any data type. It is similar to a column in a table

data = [1,2,3,4,5]
series = pd.Series(data)
print("Series \n", series)
print(type(series))

Series 
 0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>


In [3]:
data = {'a':1, 'b':2, 'c':3}
series_dict = pd.Series(data)
print(series_dict)

a    1
b    2
c    3
dtype: int64


In [4]:
data = [10,20,30]
index = ['a','b','c']
pd.Series(data, index)

a    10
b    20
c    30
dtype: int64

In [5]:
## DataFrame
## Create a DataFrame from a dictionary of list
data = {
    'Name': ['Krish', 'John', 'Jack'],
    'Age': [25,30,45],
    'City': ['Bangcock', 'New York', 'Ohio']
}

df = pd.DataFrame(data=data)
print(df)

    Name  Age      City
0  Krish   25  Bangcock
1   John   30  New York
2   Jack   45      Ohio


In [6]:
np.array(df)

array([['Krish', 25, 'Bangcock'],
       ['John', 30, 'New York'],
       ['Jack', 45, 'Ohio']], dtype=object)

In [7]:
# Create a DataFrame from a list of dictionaries

data = [
    {'Name': 'Thang',
    'Age': 23,
    'City': 'Ha Noi'},
    {'Name': 'Minh',
    'Age': 24,
    'City': 'Ha Tay'},
    {'Name': 'Chau',
    'Age': 21,
    'City': 'Ha Noi'},
]
df = pd.DataFrame(data=data)
print(df)
print(type(df))

    Name  Age    City
0  Thang   23  Ha Noi
1   Minh   24  Ha Tay
2   Chau   21  Ha Noi
<class 'pandas.core.frame.DataFrame'>


In [8]:
# Read data from csv file
pd.read_csv('sample_data.csv')

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",23,Male,Yes
1,"Minh, Nguyen Bao",22,Male,No
2,"Long, Nguyen Ngoc",23,Female,Yes


In [9]:
df = pd.read_csv('sample_data.csv')
print(df.head(2))
print("--------")
print(df.tail(1))

               Name  Age   Sex Coffe, Bread and Wine
0   Thang, Duy Phan   23  Male                   Yes
1  Minh, Nguyen Bao   22  Male                    No
--------
                Name  Age     Sex Coffe, Bread and Wine
2  Long, Nguyen Ngoc   23  Female                   Yes


In [10]:
print(df['Name'])

0      Thang, Duy Phan
1     Minh, Nguyen Bao
2    Long, Nguyen Ngoc
Name: Name, dtype: object


In [None]:
print(df.loc[1]) # location

Name                     Minh, Nguyen Bao
Age                                    22
Sex                                  Male
Coffe, Bread and Wine                  No
Name: 1, dtype: object


In [12]:
print(df.iloc[2]) # index location

Name                     Long, Nguyen Ngoc
Age                                     23
Sex                                 Female
Coffe, Bread and Wine                  Yes
Name: 2, dtype: object


In [17]:
df.iloc[2][0]

  df.iloc[2][0]


'Long, Nguyen Ngoc'

In [None]:
## Accessing a specified element
df.at[0,'Age']

np.int64(23)

In [21]:
df.iat[2,2]

'Female'

In [23]:
## Data Manipulation with DataFrame
df

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",23,Male,Yes
1,"Minh, Nguyen Bao",22,Male,No
2,"Long, Nguyen Ngoc",23,Female,Yes


In [24]:
df['Salary'] = [5000,6000,7000]

In [25]:
df

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine",Salary
0,"Thang, Duy Phan",23,Male,Yes,5000
1,"Minh, Nguyen Bao",22,Male,No,6000
2,"Long, Nguyen Ngoc",23,Female,Yes,7000


In [26]:
## Remove a column
df.drop('Salary', axis=1)

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",23,Male,Yes
1,"Minh, Nguyen Bao",22,Male,No
2,"Long, Nguyen Ngoc",23,Female,Yes


In [None]:
df
# -> drop not saved

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine",Salary
0,"Thang, Duy Phan",23,Male,Yes,5000
1,"Minh, Nguyen Bao",22,Male,No,6000
2,"Long, Nguyen Ngoc",23,Female,Yes,7000


In [29]:
df.drop('Salary',axis=1, inplace=True)

In [30]:
df

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",23,Male,Yes
1,"Minh, Nguyen Bao",22,Male,No
2,"Long, Nguyen Ngoc",23,Female,Yes


In [31]:
## Add age to column
df["Age"] += 1
df

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",24,Male,Yes
1,"Minh, Nguyen Bao",23,Male,No
2,"Long, Nguyen Ngoc",24,Female,Yes


In [32]:
df.drop(1)

Unnamed: 0,Name,Age,Sex,"Coffe, Bread and Wine"
0,"Thang, Duy Phan",24,Male,Yes
2,"Long, Nguyen Ngoc",24,Female,Yes


In [35]:
# Display data types of each column
print("Data types:\n", df.dtypes)

# Describe the DataFrame
print("Statistical summary:\n", df.describe())

# Group by a column and perform an aggregation
grouped = df.groupby('Category')['Value'].mean()
print("Mean value by category:\n", grouped)

Data types:
 Name                     object
Age                       int64
Sex                      object
Coffe, Bread and Wine    object
dtype: object
Statistical summary:
              Age
count   3.000000
mean   23.666667
std     0.577350
min    23.000000
25%    23.500000
50%    24.000000
75%    24.000000
max    24.000000


KeyError: 'Category'