# Day 6: Introduction to Pandas
---

## Objectives:
- Get familiar with the basics of Pandas for data manipulation.
- Understand how to load data into Pandas DataFrames and perform basic operations.

---

## Topics to Cover:
### 1. Introduction to Pandas library and DataFrames:
- #### What is Pandas?
Pandas is an open-source data manipulation and analysis library built on top of NumPy. It provides powerful data structures like `DataFrame` and `Series` to manage and analyze data easily.
- #### DataFrame and Series concepts:
`DataFrame`: A 2D data structure (like a table with rows and columns) that holds data of different types (strings, integers, floats, etc.).
`Series`: A one-dimensional array, often used to represent a single column of data in a DataFrame.
- #### Why use Pandas for data analysis?
Pandas makes data manipulation and analysis much easier due to its versatile functionality, like reading/writing data from different formats `(CSV, Excel, JSON)`, and its built-in methods for analyzing and processing data.

---

### 2. Loading data into Pandas:
- #### `pd.read_csv()`: ****Loading data from CSV files****
This function allows you to read data from a CSV file and load it into a Pandas DataFrame.

Example:

In [None]:
import pandas as pd
df = pd.read_csv('path_to_your_file.csv')

- #### `pd.read_excel()`: Loading data from Excel files
Similarly, you can load data from Excel files with this function.

Example:

In [None]:
df = pd.read_excel('path_to_your_file.xlsx')

---

### 3. Basic operations on DataFrames:

In [1]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 70000, 80000, 90000],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR']
}

df = pd.DataFrame(data)

- #### `head()`: View the first few rows of the dataset

This function helps you preview the first few rows of your dataset, which is useful to check if the data is loaded correctly.

In [2]:
df.head()

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25,50000,HR
1,Bob,30,60000,IT
2,Charlie,35,70000,Finance
3,David,40,80000,IT
4,Eva,45,90000,HR


- #### `info()`: Get metadata about the DataFrame
This function gives an overview of the DataFrame, showing data types, column names, and the number of non-null entries in each column.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   Salary      5 non-null      int64 
 3   Department  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes


- #### `describe()`: Summary statistics of numerical columns
This method provides basic statistics (like mean, standard deviation, min, max) for numerical columns in your DataFrame.

In [4]:
df.describe()

Unnamed: 0,Age,Salary
count,5.0,5.0
mean,35.0,70000.0
std,7.905694,15811.388301
min,25.0,50000.0
25%,30.0,60000.0
50%,35.0,70000.0
75%,40.0,80000.0
max,45.0,90000.0


---

### 4. Accessing rows and columns:
- #### Accessing columns directly:
You can access a column by using its name as an attribute or as a key.

In [5]:
df['Name']

0      Alice
1        Bob
2    Charlie
3      David
4        Eva
Name: Name, dtype: object

In [6]:
df.Name

0      Alice
1        Bob
2    Charlie
3      David
4        Eva
Name: Name, dtype: object

- #### Using `loc[]` for label-based indexing:
`loc[]` allows you to access rows and columns by label names (column names or row indices).

In [7]:
df.loc[0]

Name          Alice
Age              25
Salary        50000
Department       HR
Name: 0, dtype: object

In [8]:
df.loc[:, 'Age']

0    25
1    30
2    35
3    40
4    45
Name: Age, dtype: int64

- #### Using `iloc[]` for position-based indexing:
`iloc[]` works by position (index number) rather than labels.

In [9]:
df.iloc[0]

Name          Alice
Age              25
Salary        50000
Department       HR
Name: 0, dtype: object

In [10]:
df.iloc[:, 0]

0      Alice
1        Bob
2    Charlie
3      David
4        Eva
Name: Name, dtype: object

---

### 5. Adding and removing columns:

In [11]:
new_df = pd.DataFrame(['John', 'Hob', 'Michael', 'Alex', 'Vincent'], columns=['Last_name'])
df = pd.concat([df, new_df], axis=1)

- #### Adding a new column:
You can add a new column by performing operations on existing columns.

In [12]:
df['Full_name'] = df['Name'] + df['Last_name']
df

Unnamed: 0,Name,Age,Salary,Department,Last_name,Full_name
0,Alice,25,50000,HR,John,AliceJohn
1,Bob,30,60000,IT,Hob,BobHob
2,Charlie,35,70000,Finance,Michael,CharlieMichael
3,David,40,80000,IT,Alex,DavidAlex
4,Eva,45,90000,HR,Vincent,EvaVincent


- #### Removing columns using `drop()`:
`drop()` can be used to remove one or more columns from a DataFrame.

In [13]:
df.drop(columns=['Full_name'], inplace=True)
df

Unnamed: 0,Name,Age,Salary,Department,Last_name
0,Alice,25,50000,HR,John
1,Bob,30,60000,IT,Hob
2,Charlie,35,70000,Finance,Michael
3,David,40,80000,IT,Alex
4,Eva,45,90000,HR,Vincent


---

### 6. DataFrame slicing, filtering, and sorting:
- #### Slicing rows/columns:
You can slice specific rows or columns from a DataFrame.

In [14]:
sliced_df = df.iloc[0:5, :]
sliced_df

Unnamed: 0,Name,Age,Salary,Department,Last_name
0,Alice,25,50000,HR,John
1,Bob,30,60000,IT,Hob
2,Charlie,35,70000,Finance,Michael
3,David,40,80000,IT,Alex
4,Eva,45,90000,HR,Vincent


- #### Filtering DataFrames using conditions:
Apply conditions to filter rows in your DataFrame.

In [16]:
filtered_df = df[df['Salary'] > 75000]
filtered_df

Unnamed: 0,Name,Age,Salary,Department,Last_name
3,David,40,80000,IT,Alex
4,Eva,45,90000,HR,Vincent


- #### Sorting DataFrames using sort_values():
You can sort the DataFrame by specific columns.

In [19]:
sorted_df = df.sort_values(by='Salary', ascending=False)
sorted_df

Unnamed: 0,Name,Age,Salary,Department,Last_name
4,Eva,45,90000,HR,Vincent
3,David,40,80000,IT,Alex
2,Charlie,35,70000,Finance,Michael
1,Bob,30,60000,IT,Hob
0,Alice,25,50000,HR,John


---

## Exercises:
### 1. Load a dataset:
Load a dataset from a CSV file (for example, the Titanic dataset) and inspect the first few rows using `head()`.

In [35]:
import pandas as pd
df = pd.read_csv("Day_6_bmi_dataset.csv")
df.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,161,89,4
1,Male,179,127,4
2,Male,172,139,5
3,Male,153,104,5
4,Male,165,68,2


#### Explanation:
This reads the Titanic dataset from a CSV file into a Pandas DataFrame and displays the first 5 rows using the `head()` method.

### 2. Perform basic column-wise operations:
- Add a new column by performing an operation on existing columns (for example, calculating `BMI` from height and weight).

In [36]:
df['BMI'] = df['Weight'] / (df['Height'] / 100) ** 2
df.head()

Unnamed: 0,Gender,Height,Weight,Index,BMI
0,Male,161,89,4,34.335095
1,Male,179,127,4,39.636715
2,Male,172,139,5,46.984857
3,Male,153,104,5,44.427357
4,Male,165,68,2,24.977043


### Explanation:
This adds a new column called `BMI` by dividing the weight by the square of the height in meters.

- Remove an unnecessary column (e.g., drop the `Gender` column).

In [37]:
df = df.drop(columns=['Gender'])
df.head()

Unnamed: 0,Height,Weight,Index,BMI
0,161,89,4,34.335095
1,179,127,4,39.636715
2,172,139,5,46.984857
3,153,104,5,44.427357
4,165,68,2,24.977043


#### Explanation:
This removes the `Gender` column from the DataFrame using the `drop()` function.

### 3. Sort the dataset:
- Sort the dataset based on a specific column (for example, sort by the `Index` column in descending order).

In [39]:
sorted_df = df.sort_values(by='Index', ascending=False)
sorted_df.head()

Unnamed: 0,Height,Weight,Index,BMI
200,186,146,5,42.201411
229,151,154,5,67.540897
188,142,91,5,45.129935
191,168,143,5,50.6661
196,144,88,5,42.438272


#### Explanation:
This sorts the DataFrame by the Age column in descending order and displays the first 5 rows of the sorted DataFrame.

---

## Conclusion:
On Day 6, we explored the fundamentals of the Pandas library, a powerful tool for data manipulation and analysis in Python. We learned how to load data into DataFrames, perform basic operations such as inspecting datasets, accessing rows and columns, and adding or removing data. Additionally, we covered methods for filtering, slicing, and sorting DataFrames. With these foundational skills, you can efficiently manage and analyze large datasets, setting the stage for more advanced data analysis techniques in the future.

***