# Pandas Library Overview

## Description

This Jupyter Notebook is a comprehensive guide to the most commonly used features of the Pandas library. It includes detailed explanations and practical examples of various functions and methods used for data manipulation, analysis, and visualization. This notebook serves as a useful reference for anyone looking to understand and utilize the powerful capabilities of Pandas for data science and analytics.

## Table of Contents
``` markdown
1. [Introduction and Data Structures](#Introduction-and-Data-Structures)
    1.1. Overview of Pandas Library
    1.2. Installation
    1.3. Series
        1.3.1. Creation
        1.3.2. Attributes and Methods
    1.4. DataFrame
        1.4.1. Creation
        1.4.2. Attributes and Methods

2. [Data Input, Output, and Exploration](#Data-Input-Output-and-Exploration)
    2.1. Reading and Writing CSV Files
    2.2. Reading and Writing Excel Files
    2.3. Reading and Writing JSON Files
    2.4. Connecting to and Querying SQL Databases
    2.5. Previewing DataFrames (head, tail)
    2.6. Summary Information (info, describe)
    2.7. Checking and Converting Data Types (dtypes, astype)
    2.8. Handling Missing Values (isnull, dropna, fillna)

3. [Data Selection, Filtering, and Transformation](#Data-Selection-Filtering-and-Transformation)
    3.1. Selecting Columns ([], loc, iloc)
    3.2. Conditional Filtering
    3.3. Using Queries (query)
    3.4. Sampling Data (sample)
    3.5. Sorting Data (sort_values, sort_index)
    3.6. Merging DataFrames (merge, join)
    3.7. Concatenating DataFrames (concat, append)
    3.8. Grouping and Aggregating Data (groupby, agg)
    3.9. Pivot Tables (pivot_table)
    3.10. Removing Duplicates (duplicated, drop_duplicates)

4. [Data Conversion, Computation, and Visualization](#Data-Conversion-Computation-and-Visualization)
    4.1. Converting Data Types (astype)
    4.2. Applying Functions (apply, map, applymap)
    4.3. String Manipulation (str accessor)
    4.4. Handling Date and Time Data (pd.to_datetime, dt accessor)
    4.5. Basic Visualization (plot, hist, box, scatter)
    4.6. Visualization with Seaborn

5. [Advanced Topics and Practical Examples](#Advanced-Topics-and-Practical-Examples)
    5.1. MultiIndex DataFrames
    5.2. Reshaping DataFrames (melt, stack, unstack)
    5.3. Applying Custom Functions (applymap, transform)
    5.4. EDA Project Example
    5.5. Machine Learning Data Preprocessing Example
    5.6. Time Series Data Analysis Example
```

In [17]:
import pandas as pd

## 1. Introduction and Data Structures

### 1.1. Overview of Pandas Library
Pandas is a powerful and flexible open-source data analysis and manipulation library for Python. It provides data structures like Series and DataFrame, which are designed to make data manipulation and analysis fast and easy. Pandas is built on top of NumPy and is often used in conjunction with other libraries such as Matplotlib and Scikit-learn.

### 1.2. Installation
To install Pandas, you can use pip (Python's package installer) by running the following command in your terminal or command prompt:
`!pip install pandas`

### 1.3 Series
#### 1.3.1 Creation
A Pandas Series is a one-dimensional array-like object that can hold various data types such as integers, floats, and strings. It is similar to a column in a table or a spreadsheet.

In [None]:
# Creating a Series from a list
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
series

In [25]:
# Creating a Series with custom index
data = [1, 2, None, 4, 'c']
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
series

a       1
b       2
c    None
d       4
e       c
dtype: object

#### 1.3.2 Attributes and Methods
Pandas Series comes with a variety of attributes and methods that make data manipulation straightforward. Here are some of the commonly used attributes and methods:

| **Method/Attribute** | **Description**                                      | **Example Code**                                             |
|----------------------|------------------------------------------------------|--------------------------------------------------------------|
| `series.sum()`       | Calculates the sum of the Series.                    | `series.sum()`                                               |
| `series.mean()`      | Calculates the mean of the Series.                   | `series.mean()`                                              |
| `series.max()`       | Finds the maximum value in the Series.               | `series.max()`                                               |
| `series.isnull()`    | Checks for null values in the Series, returns a Boolean Series. | `series_with_nan.isnull()`                                   |
| `series.notnull()`   | Checks for non-null values in the Series, returns a Boolean Series. | `series_with_nan.notnull()`                                  |
| `series.fillna()`    | Fills null values with the specified value.          | `series_with_nan.fillna(0)`                                  |
| `series.apply()`     | Applies a function to each element of the Series.    | `series.apply(lambda x: x ** 2)`                             |
| `series.count()`     | Counts the non-null elements in the Series.          | `series.count()`                                             |
| `series.std()`       | Calculates the standard deviation of the Series.     | `series.std()`                                               |
| `series.median()`    | Calculates the median of the Series.                 | `series.median()`                                            |
| `series.quantile()`  | Calculates the specified quantile of the Series.     | `series.quantile(0.25)`                                      |
| `series.value_counts()` | Counts the occurrences of each value in the Series.  | `series.value_counts()`                                      |
| `series.sort_values()` | Sorts the Series by its values.                      | `series.sort_values()`                                       |
| `series.rank()`         | Ranks the values in the Series.                     | `series.rank()`                                              |
| `series.cumsum()`       | Calculates the cumulative sum of the Series.        | `series.cumsum()`                                            |
| `series.shift()`        | Shifts the values in the Series by the specified number of periods. | `series.shift(1)`                                            |
| `np.log(series)`        | Applies the NumPy log function to the Series.       | `np.log(series)`                                             |
| `series.reindex()`      | Reindexes the Series with the specified index, filling missing values with the specified value. | `series.reindex(['a', 'b', 'c', 'd', 'e', 'f'], fill_value=0)` |


In [31]:
# Checking for missing values
series.isnull()

a    False
b    False
c     True
d    False
e    False
dtype: bool

In [32]:
# Checking for non-missing values
series.notnull()

a     True
b     True
c    False
d     True
e     True
dtype: bool

In [33]:
# Filling missing values
series.fillna(9999)

a       1
b       2
c    9999
d       4
e       c
dtype: object

In [None]:
# Creating a Series with custom index
data = [10, 2, 7, 4, 1]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)

In [None]:
# Ranking
series.rank()

a    5.0
b    2.0
c    4.0
d    3.0
e    1.0
dtype: float64

In [41]:
# Sorting
series.sort_values()

e     1
b     2
d     4
c     7
a    10
dtype: int64

In [42]:
# Cumulative sum
series.cumsum()

a    10
b    12
c    19
d    23
e    24
dtype: int64

In [38]:
# Shifting values
series.shift(1)

a     NaN
b    10.0
c     2.0
d     7.0
e     4.0
dtype: float64

In [39]:
series.shift(-1)

a    2.0
b    7.0
c    4.0
d    1.0
e    NaN
dtype: float64

In [40]:
# Applying a function to each element
series.apply(lambda x: x ** 2)

a    100
b      4
c     49
d     16
e      1
dtype: int64

### 1.4. DataFrame

#### 1.4.1. Creation
A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
```python
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['foo', 'bar', 'baz', 'qux', 'quux']
}
df = pd.DataFrame(data)
print("DataFrame from dictionary:\n", df)

# Creating a DataFrame from a list of dictionaries
data = [
    {'A': 1, 'B': 10, 'C': 'foo'},
    {'A': 2, 'B': 20, 'C': 'bar'},
    {'A': 3, 'B': 30, 'C': 'baz'}
]
df = pd.DataFrame(data)
print("DataFrame from list of dictionaries:\n", df)

# Creating a DataFrame from a list of lists
data = [
    [1, 10, 'foo'],
    [2, 20, 'bar'],
    [3, 30, 'baz']
]
columns = ['A', 'B', 'C']
df = pd.DataFrame(data, columns=columns)
print("DataFrame from list of lists:\n", df)

#### 1.4.2. Attributes and Methods
Pandas DataFrame comes with a variety of attributes and methods that make data manipulation straightforward. Here are some of the commonly used attributes and methods:

In [50]:
# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': ['foo', 'bar', 'baz', 'qux', 'quux']
}
df = pd.DataFrame(data)
df

Unnamed: 0,A,B,C
0,1,10,foo
1,2,20,bar
2,3,30,baz
3,4,40,qux
4,5,50,quux


In [51]:
# Accessing DataFrame attributes
print("DataFrame shape:", df.shape)
print("DataFrame columns:", df.columns)
print("DataFrame index:", df.index)
print("DataFrame data types:\n", df.dtypes)

DataFrame shape: (5, 3)
DataFrame columns: Index(['A', 'B', 'C'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=5, step=1)
DataFrame data types:
 A     int64
B     int64
C    object
dtype: object


In [52]:
# Viewing data
print("Head of the DataFrame:\n", df.head())
print("Tail of the DataFrame:\n", df.tail())

Head of the DataFrame:
    A   B     C
0  1  10   foo
1  2  20   bar
2  3  30   baz
3  4  40   qux
4  5  50  quux
Tail of the DataFrame:
    A   B     C
0  1  10   foo
1  2  20   bar
2  3  30   baz
3  4  40   qux
4  5  50  quux


In [54]:
# Basic statistics (only numeric columns)
print("DataFrame description:\n", df.describe(include=[int, float]))
print("Sum of each column:\n", df.sum(numeric_only=True))
print("Mean of each column:\n", df.mean(numeric_only=True))
print("Max of each column:\n", df.max(numeric_only=True))

DataFrame description:
               A          B
count  5.000000   5.000000
mean   3.000000  30.000000
std    1.581139  15.811388
min    1.000000  10.000000
25%    2.000000  20.000000
50%    3.000000  30.000000
75%    4.000000  40.000000
max    5.000000  50.000000
Sum of each column:
 A     15
B    150
dtype: int64
Mean of each column:
 A     3.0
B    30.0
dtype: float64
Max of each column:
 A     5
B    50
dtype: int64


In [56]:
df

Unnamed: 0,A,B,C
0,1,10,foo
1,2,20,bar
2,3,30,baz
3,4,40,qux
4,5,50,quux


In [55]:
df.max(numeric_only=True)

A     5
B    50
dtype: int64

In [53]:
df.mean()

TypeError: Could not convert ['foobarbazquxquux'] to numeric

In [48]:
print("Mean of each column:\n", df.mean())
print("Max of each column:\n", df.max())

TypeError: Could not convert ['foobarbazquxquux'] to numeric

In [None]:
# Selecting data
print("Selecting column 'A':\n", df['A'])
print("Selecting multiple columns 'A' and 'B':\n", df[['A', 'B']])
print("Selecting rows 0 to 2:\n", df[0:3])

# Conditional selection
print("Rows where column 'A' > 2:\n", df[df['A'] > 2])

# Adding new columns
df['D'] = df['A'] + df['B']
print("DataFrame with new column 'D':\n", df)

# Dropping columns
df = df.drop('D', axis=1)
print("DataFrame after dropping column 'D':\n", df)

# Sorting
df_sorted = df.sort_values(by='B', ascending=False)
print("DataFrame sorted by column 'B' in descending order:\n", df_sorted)

# Handling missing values
df_with_nan = df.copy()
df_with_nan.loc[1, 'A'] = None
print("DataFrame with NaN:\n", df_with_nan)
df_filled = df_with_nan.fillna(0)
print("DataFrame with NaN filled with 0:\n", df_filled)

# Applying functions
df['A_squared'] = df['A'].apply(lambda x: x ** 2)
print("DataFrame with column 'A' squared:\n", df)

DataFrame shape: (5, 3)
DataFrame columns: Index(['A', 'B', 'C'], dtype='object')
DataFrame index: RangeIndex(start=0, stop=5, step=1)
DataFrame data types:
 A     int64
B     int64
C    object
dtype: object
Head of the DataFrame:
    A   B     C
0  1  10   foo
1  2  20   bar
2  3  30   baz
3  4  40   qux
4  5  50  quux
Tail of the DataFrame:
    A   B     C
0  1  10   foo
1  2  20   bar
2  3  30   baz
3  4  40   qux
4  5  50  quux
DataFrame description:
               A          B
count  5.000000   5.000000
mean   3.000000  30.000000
std    1.581139  15.811388
min    1.000000  10.000000
25%    2.000000  20.000000
50%    3.000000  30.000000
75%    4.000000  40.000000
max    5.000000  50.000000
Sum of each column:
 A                  15
B                 150
C    foobarbazquxquux
dtype: object


TypeError: Could not convert ['foobarbazquxquux'] to numeric

## 4. Data Conversion, Computation, and Visualization

### 4.1 Converting Data Types

In [14]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4], 'B': ['10', '20', '30', '40']}
df = pd.DataFrame(data)

# Converting column B from string to integer
df['B'] = df['B'].astype(int)
df.dtypes

A    int64
B    int32
dtype: object

### Applying Functions
- Using `apply()`, `map()`, and `applymap()`

In [8]:
# Applying a function to a column using apply()
df['A_squared'] = df['A'].apply(lambda x: x ** 2)
df

Unnamed: 0,A,B,A_squared,B_mapped
0,1,10,1,X
1,2,20,4,Y
2,3,30,9,Z
3,4,40,16,W


In [9]:
# Mapping values in a column using map()
df['B_mapped'] = df['B'].map({10: 'X', 20: 'Y', 30: 'Z', 40: 'W'})
df

Unnamed: 0,A,B,A_squared,B_mapped
0,1,10,1,X
1,2,20,4,Y
2,3,30,9,Z
3,4,40,16,W


In [None]:
# Applying a function to all elements of the DataFrame using applymap()
df_applied = df.applymap(lambda x: str(x) + '!')
df_applied

Unnamed: 0,A,B,A_squared,B_mapped
0,1!,10!,1!,X!
1,2!,20!,4!,Y!
2,3!,30!,9!,Z!
3,4!,40!,16!,W!


### String Manipulation
- Using the `str` accessor for string operations

In [15]:
# Sample DataFrame with string data
data = {'C': ['apple', 'banana', 'cherry', 'date']}
df_str = pd.DataFrame(data)
df_str

Unnamed: 0,C
0,apple
1,banana
2,cherry
3,date


In [11]:
# Converting to uppercase
df_str['C_upper'] = df_str['C'].str.upper()
df_str

Unnamed: 0,C,C_upper
0,apple,APPLE
1,banana,BANANA
2,cherry,CHERRY
3,date,DATE


In [12]:
# Checking if the string contains a substring
df_str['C_has_a'] = df_str['C'].str.contains('a')
df_str

Unnamed: 0,C,C_upper,C_has_a
0,apple,APPLE,True
1,banana,BANANA,True
2,cherry,CHERRY,False
3,date,DATE,True
