## Pandas

* Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame.
* DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
* As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
* DataFrame can be said to be a combination of a Dictionary object Series that has the same index.

### Installation

```bash
pip install pandas
```

### Importing Pandas

In [1]:
import pandas as pd

The two main data structures in pandas are Series and DataFrame.

1. **Series:**
A Series is a one-dimensional labeled array capable of holding any data type. It can be created from a list, array, or dictionary.

In [2]:
# Create a Series from a list
series = pd.Series([1, 2, 3, 4, 5])

# Create a Series from a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
series = pd.Series(data)

In [3]:
print(data)

{'a': 10, 'b': 20, 'c': 30}


In [4]:
print(series)

a    10
b    20
c    30
dtype: int64


2. **DataFrame:**
A DataFrame is a two-dimensional tabular data structure with labeled axes (rows and columns). It can be created from a dictionary, list of dictionaries, or a 2D array.

In [5]:
# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

In [6]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,22,Chicago


### Exploring Data

We can use various methods to explore its structure and contents.

In [7]:
# Display the first few rows of the DataFrame
print(df.head())  # Default is the first 5 rows

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago


In [8]:
# Get a summary of the DataFrame
print(df.info())  # Information about data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 120.0+ bytes
None


In [9]:
# Display data types of columns
print(df.dtypes)

Name    object
Age      int64
City    object
dtype: object


In [10]:
# Display the last few rows
print(df.tail())

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago


In [11]:
# Get the shape of the DataFrame (number of rows and columns)
print(df.shape)  # Output: (number_of_rows, number_of_columns)

(3, 3)


In [12]:
# Get basic statistics for numerical columns
print(df.describe())

             Age
count   3.000000
mean   25.666667
std     4.041452
min    22.000000
25%    23.500000
50%    25.000000
75%    27.500000
max    30.000000


### Selecting Data

In [13]:
# Select a single column
column = df['Age']
column

0    25
1    30
2    22
Name: Age, dtype: int64

In [14]:
# Select multiple columns
subset = df[['Name', 'City']]
subset

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Los Angeles
2,Charlie,Chicago


In [15]:
# Select rows based on a condition
filtered_df = df[df['Age'] > 25]
filtered_df

Unnamed: 0,Name,Age,City
1,Bob,30,Los Angeles


### Adding and Deleting Columns

In [16]:
# Add a new column
df['Salary'] = [60000, 75000, 50000]
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,60000
1,Bob,30,Los Angeles,75000
2,Charlie,22,Chicago,50000


In [17]:
# Delete a column
del df['Salary']
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,22,Chicago


### Sorting Data

In [18]:
sorted_df = df.sort_values(by='Age', ascending=False)
sorted_df

Unnamed: 0,Name,Age,City
1,Bob,30,Los Angeles
0,Alice,25,New York
2,Charlie,22,Chicago


### Grouping and Aggregating Data

In [19]:
# Compute the mean of a column
mean_age = df['Age'].mean()
mean_age

25.666666666666668

In [20]:
# Compute the maximum value of a column
max_age = df['Age'].max()
max_age

30

In [21]:
# Group by "City" and calculate the mean age
grouped = df.groupby("City")["Age"].mean()
print(grouped)

City
Chicago        22.0
Los Angeles    30.0
New York       25.0
Name: Age, dtype: float64


### Merging and Joining

In [22]:
# DataFrames to merge
df1 = pd.DataFrame({
    "ID": [1, 2, 3],
    "Name": ["Alice", "Bob", "Charlie"],
})

df2 = pd.DataFrame({
    "ID": [2, 3, 4],
    "Age": [25, 30, 35],
})

**Inner join on "ID"**

An inner join returns rows with matching keys from both DataFrames. In this example, only rows with matching "ID" in both DataFrames are included in the merged DataFrame.

In [28]:
merged_inner = pd.merge(df1, df2, on="ID", how="inner")
print(merged_inner)

   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30


**Outer join on "ID"**

An outer join returns all rows from both DataFrames, filling missing values with NaN where there's no match.
In this example, all rows from both DataFrames are included, with NaN for missing values.

In [26]:
merged_outer = pd.merge(df1, df2, on="ID", how="outer")
print(merged_outer)

   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0
3   4      NaN  35.0


**Left join on "ID"**

A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Unmatched entries are filled with NaN.

In [30]:
merged_left = pd.merge(df1, df2, on="ID", how="left")
print(merged_left)

   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0


**Right join on "ID"**

A right join is the reverse of a left join, returning all rows from the right DataFrame and matching rows from the left DataFrame.

In [31]:
merged_right = pd.merge(df1, df2, on="ID", how="right")
print(merged_right)

   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30
2   4      NaN   35


**Concatenation**

In [34]:
# Concatenate DataFrames vertically
concatenated_df = pd.concat([df1, df2], axis=0)
print(concatenated_df)

   ID     Name   Age
0   1    Alice   NaN
1   2      Bob   NaN
2   3  Charlie   NaN
0   2      NaN  25.0
1   3      NaN  30.0
2   4      NaN  35.0


### Handling Missing Data

In [35]:
# Check for missing values
print(merged_right.isnull())

      ID   Name    Age
0  False  False  False
1  False  False  False
2  False   True  False


In [36]:
# Drop rows with missing values
df_cleaned = merged_right.dropna()
df_cleaned

Unnamed: 0,ID,Name,Age
0,2,Bob,25
1,3,Charlie,30


In [38]:
# Fill missing values with a specific value
df_filled = merged_right.fillna("Fulan")
df_filled

Unnamed: 0,ID,Name,Age
0,2,Bob,25
1,3,Charlie,30
2,4,Fulan,35


### Reading and Writing Data

In [39]:
# Read data from a CSV file
df_iris = pd.read_csv('dataset/iris.csv')
df_iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


In [40]:
# Write data to a CSV file
df_filled.to_csv('output.csv', index=False)