## **Python Pandas Tutorials - Part 1**

**1. Introduction to Pandas:**
- Pandas is a powerful open-source library for data manipulation and analysis in Python.
- It provides two main data structures: Series and DataFrame, which are built on top of NumPy arrays.

**2. Pandas Series:**
- A Pandas Series is a one-dimensional labeled array that can hold data of any type (integers, strings, etc.).
- It is similar to a NumPy array but has additional labels (index) for each element.
- Creating a Series: `pd.Series(data, index)`.
- Accessing Elements: Use the index label to access specific elements.

**3. Pandas DataFrame:**
- A Pandas DataFrame is a two-dimensional labeled data structure, like a table or spreadsheet.
- It consists of rows and columns, where each column can hold different data types.
- Creating a DataFrame: Various methods include dictionaries, lists, NumPy arrays, CSV files, etc.

**4. Basic DataFrame Operations:**
- Loading Data: Use functions like `pd.read_csv()`, `pd.read_excel()`, etc., to read data into a DataFrame.
- Viewing Data: `head()`, `tail()`, `sample()`, `shape`, and `info()` provide information about the DataFrame.
- Accessing Data: Use indexing and slicing to extract rows and columns from the DataFrame.
- Data Cleaning: Handle missing values, duplicate rows, and data type conversions.
- Filtering Data: Use Boolean indexing to filter rows based on conditions.
- Sorting Data: `sort_values()`, `sort_index()`, and `sort_values(by=column)` to sort data.
- Aggregating Data: Functions like `sum()`, `mean()`, `min()`, `max()`, etc., to compute summary statistics.
- Adding and Modifying Data: `insert()`, `assign()`, and arithmetic operations to add or modify columns.
- Merging DataFrames: `merge()`, `concat()`, and `join()` to combine data from different DataFrames.

**5. Data Visualization with Pandas:**
- Pandas integrates with Matplotlib to provide basic data visualization capabilities.
- Use functions like `plot()`, `hist()`, `scatter()`, etc., to create simple plots directly from DataFrames.

**6. Exporting Data:**
- Save DataFrame to various file formats using functions like `to_csv()`, `to_excel()`, etc.

**7. Conclusion:**
- Pandas is a versatile library for data analysis, providing powerful tools to handle and manipulate data efficiently.
- Understanding the basics of Series, DataFrame, and common operations is essential for data exploration and analysis tasks.

*Note: These notes provide a concise overview of Part 1 of the Python Pandas Tutorials. For more detailed explanations and examples, refer to the actual tutorial content on GitHub or online Pandas documentation.*

In [2]:
import pandas as pd
import numpy as np

In [24]:
# lets make a 
new_data = np.arange(0,30).reshape(6,5)
new_data

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29]])

pd.DataFrame(
    data=None,
    index: 'Axes | None' = None,
    columns: 'Axes | None' = None,
    dtype: 'Dtype | None' = None,
    copy: 'bool | None' = None,
) 

In [96]:
## Create Dataframe

df = pd.DataFrame(data=new_data, index=['Row1','Row2','Row3','Row4','Row5','Row6'], 
                  columns=['Column1','Column2','Column3','Column4','Column5'])

In [27]:
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24
Row6,25,26,27,28,29


In [31]:
# first 5 rows
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24


In [32]:
# last 5 rows
df.tail()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24
Row6,25,26,27,28,29


In [29]:
# ramdom 4 rows

df.sample(4)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row5,20,21,22,23,24
Row6,25,26,27,28,29
Row3,10,11,12,13,14


In [33]:
type(df)

pandas.core.frame.DataFrame

In [34]:
# info() provide information about the DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, Row1 to Row6
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Column1  6 non-null      int32
 1   Column2  6 non-null      int32
 2   Column3  6 non-null      int32
 3   Column4  6 non-null      int32
 4   Column5  6 non-null      int32
dtypes: int32(5)
memory usage: 168.0+ bytes


## describe()

In Pandas, `describe()` is a useful method used to generate descriptive statistics for a DataFrame or Series. It provides a summary of the central tendency, dispersion, and shape of the data. The `describe()` method can be applied to both numeric and non-numeric data.

For DataFrame:

- For numeric data, it calculates statistics like count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum for each numeric column.
- For non-numeric (categorical) data, it provides statistics like count, unique, the most frequent value, and its frequency.

For Series:

- For numeric data, it calculates statistics like count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum.
- For non-numeric (categorical) data, it provides statistics like count, unique, the most frequent value, and its frequency.

Example for DataFrame:

```python

# Sample DataFrame
data = {
    'Age': [25, 30, 35, 28, 40],
    'Salary': [50000, 60000, 75000, 45000, 80000],
}

df = pd.DataFrame(data)

# Using describe() on DataFrame
description = df.describe()

print(description)
```

Output:
```
            Age        Salary
count   5.000000      5.000000
mean   31.600000  62000.000000
std     5.220153  15000.000000
min    25.000000  45000.000000
25%    28.000000  50000.000000
50%    30.000000  60000.000000
75%    35.000000  75000.000000
max    40.000000  80000.000000
```

In the example above, we created a simple DataFrame with 'Age' and 'Salary' columns. The `describe()` method provides statistical information, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for both columns.

Example for Series:

```python

# Sample Series
data = [25, 30, 35, 28, 40]

series = pd.Series(data)

# Using describe() on Series
description = series.describe()

print(description)
```

Output:
```
count     5.000000
mean     31.600000
std       5.220153
min      25.000000
25%      28.000000
50%      30.000000
75%      35.000000
max      40.000000
dtype: float64
```

In this case, the `describe()` method provides the same statistical information for the numeric Series 'series'.

The `describe()` method is a handy tool for quickly obtaining an overview of your data, especially when working with large datasets. It helps you gain insights into the distribution and characteristics of your data, aiding in initial data exploration and analysis.

In [35]:
df.describe()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
count,6.0,6.0,6.0,6.0,6.0
mean,12.5,13.5,14.5,15.5,16.5
std,9.354143,9.354143,9.354143,9.354143,9.354143
min,0.0,1.0,2.0,3.0,4.0
25%,6.25,7.25,8.25,9.25,10.25
50%,12.5,13.5,14.5,15.5,16.5
75%,18.75,19.75,20.75,21.75,22.75
max,25.0,26.0,27.0,28.0,29.0


##  `loc` and `iloc`

In Pandas, `loc` and `iloc` are two powerful and commonly used methods for accessing data in a DataFrame. They allow you to retrieve specific rows and columns from a DataFrame based on their labels (`loc`) or integer-based positions (`iloc`).

**`loc`:**

- The `loc` method is primarily label-based and is used to access data using row and column labels (index labels).
- It takes two arguments separated by a comma: `loc[row_label, column_label]`.
- You can use labels (row and column names) to retrieve specific data.

**Example:**

```python

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
}

df = pd.DataFrame(data, index=['A', 'B', 'C'])

# Accessing data using loc
alice_data = df.loc['A']          # Get the data for row with label 'A'
bob_age = df.loc['B', 'Age']      # Get the 'Age' value for row with label 'B'

print("Data for row 'A':")
print(alice_data)

print("\nAge of 'Bob':")
print(bob_age)
```

**`iloc`:**

- The `iloc` method is primarily integer-location-based and is used to access data using integer-based positions.
- It takes two arguments separated by a comma: `iloc[row_index, column_index]`.
- You can use integer-based positions to retrieve specific data.

**Example:**

```python

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
}

df = pd.DataFrame(data)

# Accessing data using iloc
alice_data = df.iloc[0]             # Get the data for the first row (index 0)
bob_age = df.iloc[1, 1]             # Get the 'Age' value for the second row (index 1)

print("Data for the first row:")
print(alice_data)

print("\nAge of the second row:")
print(bob_age)
```

In the `iloc` example, we used integer-based positions to retrieve specific rows and columns. The first row corresponds to index 0, and the second row corresponds to index 1.

Both `loc` and `iloc` methods are useful for extracting specific data from a DataFrame. Choose the appropriate method based on whether you want to access data using labels or integer positions.

In [36]:
# indexing with [loc] location and [iloc] is integer location.

## columnname,rowindex[loc],rowindex columnindex number[.iloc]

df.head()


Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24


In [41]:
## by using column name
df['Column1']

Row1     0
Row2     5
Row3    10
Row4    15
Row5    20
Row6    25
Name: Column1, dtype: int32

In [37]:
## by using column name
type(df['Column1'])

pandas.core.series.Series

In Pandas, a `Series` is a fundamental data structure that represents a one-dimensional labeled array. It can hold data of any type, including integers, floating-point numbers, strings, and more. The primary components of a Series are the data and the associated labels, known as the index.

In [39]:
df[['Column1','Column2','Column3']]

Unnamed: 0,Column1,Column2,Column3
Row1,0,1,2
Row2,5,6,7
Row3,10,11,12
Row4,15,16,17
Row5,20,21,22
Row6,25,26,27


In [40]:
##using row index name loc
df.loc[['Row3','Row4']]

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row3,10,11,12,13,14
Row4,15,16,17,18,19


In [46]:
##using row index name loc

df.loc['Row1']

Column1    0
Column2    1
Column3    2
Column4    3
Column5    4
Name: Row1, dtype: int32

In [47]:
##using row index name loc

df.loc[['Row1']]

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4


In [48]:
df.loc[['Row1','Row4']]

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row4,15,16,17,18,19


In [49]:
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24
Row6,25,26,27,28,29


In [52]:
# using iloc, 15, 16, 17
#             20, 21, 22
    
df.iloc[3:5 , 0:3]


Unnamed: 0,Column1,Column2,Column3
Row4,15,16,17
Row5,20,21,22


In [71]:
df.iloc[:,[0,3]]

Unnamed: 0,Column1,Column4
Row1,0,3
Row2,5,8
Row3,10,13
Row4,15,18
Row5,20,23
Row6,25,28


In [78]:
##convert dataframe into arrays
df.iloc[:,1:].values

array([[ 1,  2,  3,  4],
       [ 6,  7,  8,  9],
       [11, 12, 13, 14],
       [16, 17, 18, 19],
       [21, 22, 23, 24],
       [26, 27, 28, 29]])

In [79]:
## Basic operations
df.isnull().sum()

Column1    0
Column2    0
Column3    0
Column4    0
Column5    0
dtype: int64

In [95]:
# create a new dataframe with Nan value

df2 = pd.DataFrame(data=[[1,np.nan,2],[1,3,4]],index=["Row1",
                                                      "Row2"],columns=["Column1",
                                                                             "Column2",
                                                                             "Column3",
                                                                             ])

In [88]:
df2

Unnamed: 0,Column1,Column2,Column3
Row1,1,,2
Row2,1,3.0,4


In [89]:
df2.isnull().sum()

Column1    0
Column2    1
Column3    0
dtype: int64

In [90]:
df2.isnull().sum()==0

Column1     True
Column2    False
Column3     True
dtype: bool

In [91]:
df2

Unnamed: 0,Column1,Column2,Column3
Row1,1,,2
Row2,1,3.0,4


In [92]:
df2['Column3'].value_counts()

Column3
2    1
4    1
Name: count, dtype: int64

In [97]:
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row1,0,1,2,3,4
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24
Row6,25,26,27,28,29


In [98]:
df['Column2'].unique()

array([ 1,  6, 11, 16, 21, 26])

In [99]:
df[df['Column2']>2]

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
Row2,5,6,7,8,9
Row3,10,11,12,13,14
Row4,15,16,17,18,19
Row5,20,21,22,23,24
Row6,25,26,27,28,29
