# Introduction to Pandas

___Ahmed Diab___ - `ML` and `Data Science` Engineer 💙  
my social media communication [LeetCode profile](https://leetcode.com/u/f9QcZm2R1P/)  -  [GitHub](https://github.com/ahmeddiab1234) - [LinkedIn](https://www.linkedin.com/in/ahmed-diab-3b0631245/) - [Kaggle](https://www.kaggle.com/codecaoch)  
I will be describing and solving the 15 problems from the ___Introduction to Pandas___ on LeetCode.

In this notebook, I will walk through the solution of each problem step by step,  
focusing on how to use the pandas library for data manipulation and analysis.  
Each problem will be presented with its description, input-output examples,  
and a clean and efficient solution using pandas.  

the url for problems on [LeetCode](https://chatgpt.com/c/677f690d-c2e4-8009-940d-808a09ebc0fa)

### Problem 1
#### 2877. Create a DataFrame from List [URL](https://leetcode.com/problems/create-a-dataframe-from-list/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)


In this problem, we are tasked with creating a pandas DataFrame from a given 2D list called student_data, which contains pairs of `student IDs` and their corresponding ages. The goal is to transform this data into a tabular format with two columns: `student_id` and `age`.

To solve this:

Use the `pandas.DataFrame()` function to convert the list of lists into a structured table (DataFrame).  
The `columns` argument is used to specify the names of the columns, which are `student_id` and `age`.

<h4> The Code</h4>

In [2]:
import pandas as pd
from typing import List

def createDataframe(student_data: List[List[int]]) -> pd.DataFrame:
    df = pd.DataFrame(student_data, columns=['student_id', 'age'])
    return df


### Problem2
#### 2878. Get the Size of a DataFrame [URL](https://leetcode.com/problems/get-the-size-of-a-dataframe/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

The task is to calculate and display the number of rows and columns of a DataFrame, specifically for a DataFrame called `players`.  
The solution should return the dimensions of the DataFrame as a list in the form `[number of rows, number of columns]`.  

The `players` DataFrame contains various columns such as `player_id`, `name`, `age`, `position`, and `potentially` other columns. The goal is to determine the size of this DataFrame using pandas methods.

___Solution___: To solve this, we can use the `.shape` attribute of a pandas DataFrame, which returns a tuple containing the number of rows and columns. We can then convert this tuple to a list and return it.

<h4>The Code</h4>

In [None]:
import pandas as pd
from typing import List

def getDataframeSize(players: pd.DataFrame) -> List[int]:
    df = pd.DataFrame(players)
    return list(df.shape)


### Problem 3 
#### 2879. Display the First Three Rows [URL](https://leetcode.com/problems/display-the-first-three-rows/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___: The task is to display the first 3 rows of a DataFrame named `employees`. The DataFrame contains details about employees, including columns such as `employee_id`, `name`, `department`, and `salary`. The solution should return the top three rows of the DataFrame as output.

___Solution___: To achieve this, we can use the `.head()` method from pandas, which allows us to retrieve the first `n` rows of a DataFrame. By specifying `n=3`, we can get the top three rows.

<h4>The Code</h4>

In [None]:
import pandas as pd

def selectFirstRows(employees: pd.DataFrame) -> pd.DataFrame:
    return employees.head(3)


### Problem 4
#### 2880. Select Data [URL](https://leetcode.com/problems/select-data/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to select specific data from a DataFrame named `students`. Specifically, we need to extract the `name` and `age` of the student with `student_id = 101`. The result should be a subset of the DataFrame containing only the selected columns (`name`, `age`) for the matching student.

___Solution___:
To solve this, we can filter the rows of the DataFrame where `student_id` is equal to 101. Once filtered, we select only the `name` and `age` columns from the resulting subset.


<h4>The Code:</h4>

In [None]:
import pandas as pd

def selectData(students: pd.DataFrame) -> pd.DataFrame:
    df = students[students['student_id'] == 101][['name', 'age']]
    return df


### Problem 5
#### 2881. Create a New Column [URL](https://leetcode.com/problems/create-a-new-column/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to modify a DataFrame named `employees` by adding a new column called `bonus`. The `bonus` column should contain values that are double the corresponding values in the `salary` column. The modified DataFrame will include the existing `name` and `salary` columns along with the newly added `bonus` column.

___Solution___:
To achieve this, we can directly create a new column in the DataFrame by assigning a calculated value to it. In this case, the `bonus` column is calculated as `salary * 2`.

<h4>The Code:</h4>

In [None]:
import pandas as pd

def createBonusColumn(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = employees['salary'] * 2
    return employees


### Problme 6 
#### 2882. Drop Duplicate Rows [URL](https://leetcode.com/problems/drop-duplicate-rows/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to remove duplicate rows in a DataFrame named `customers` based on the `email` column. When duplicates are identified, only the first occurrence of each unique email should be retained in the DataFrame. The result should return the modified DataFrame with duplicates removed.

___Solution___:
To solve this, we can use the `drop_duplicates()` method in pandas. This method allows us to specify a column (or set of columns) to check for duplicates, while retaining only the first occurrence of each duplicate entry.

<h4>The Code</h4>

In [None]:
import pandas as pd

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    return customers.drop_duplicates(subset='email')


### Problem 7
#### 2883. Drop Missing Data [URL](https://leetcode.com/problems/drop-missing-data/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to remove rows from a DataFrame named `students` that contain missing values (`None` or `NaN`) in the `name` column. The resulting DataFrame should only include rows where the `name` column has valid, non-missing data.

___Solution___:
To solve this, we can use the `dropna()` method in pandas. This method allows us to specify a subset of columns to check for missing values and removes rows where the specified column(s) contain missing data.

<h4> The Code</h4>

In [None]:
import pandas as pd

def dropMissingData(students: pd.DataFrame) -> pd.DataFrame:
    return students.dropna(subset=['name'])


### Problem 8 
#### 2884. Modify Columns [URL](https://leetcode.com/problems/modify-columns/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to modify the `salary` column in a DataFrame named `employees`. Each value in the `salary` column should be multiplied by 2 to reflect a pay rise. The updated DataFrame should return the modified values in the `salary` column while keeping the `name` column unchanged.

___Solution___:
To solve this, we can directly modify the `salary` column in the DataFrame by applying the `*=` operator to double each value.

<h4>The Code</h4>

In [None]:
import pandas as pd

def modifySalaryColumn(employees: pd.DataFrame) -> pd.DataFrame:
    employees['salary'] *= 2
    return employees


### Problem 9 
#### 2885. Rename Columns [URL](https://leetcode.com/problems/rename-columns/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to rename the columns of a DataFrame named `students` as follows:

`id` → `student_id`  
`first` → `first_name`  
`last` → `last_name`  
`age` → `age_in_years`  
The resulting DataFrame should reflect these updated column names, while the data in the rows remains unchanged.  

___Solution___:
To solve this, the column names of the DataFrame can be updated by assigning a new list of column names to the `columns` attribute.

<h4>The Code</h4>

In [None]:
import pandas as pd

def renameColumns(students: pd.DataFrame) -> pd.DataFrame:
    students.columns = ['student_id', 'first_name', 'last_name', 'age_in_years']
    return students


### Problem 10
#### 2886. Change Data Type [URL](https://leetcode.com/problems/change-data-type/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata])

___Description___:
The task is to modify the `students` DataFrame by correcting the data type of the `grade` column. Currently, the `grade` column is stored as floats, and it needs to be converted to integers.

___Solution___:
To solve this, the `astype()` method in pandas can be used to cast the `grade` column to the `int` data type.

<h4> The Code </h4>

In [None]:
import pandas as pd

def changeDatatype(students: pd.DataFrame) -> pd.DataFrame:
    students['grade'] = students['grade'].astype(int)
    return students


### Problem 11
#### 2887. Fill Missing Data [URL](https://leetcode.com/problems/fill-missing-data/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to fill missing values (NaN) in the `quantity` column of the `products` DataFrame with 0.

___Solution___:
To solve this, the `fillna()` method in pandas is used to replace all missing values in the `quantity` column with `0`.

<h4> The Code</h4>

In [None]:
import pandas as pd

def fillMissingValues(products: pd.DataFrame) -> pd.DataFrame:
    products['quantity'] = products['quantity'].fillna(0)
    return products


### Problem 12
#### 2888. Reshape Data: Concatenate [URL](https://leetcode.com/problems/reshape-data-concatenate/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to vertically concatenate two DataFrames (`df1` and `df2`) into a single DataFrame.

<h4> The Code</h4>

In [None]:
import pandas as pd

def concatenateTables(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
    df = pd.concat([df1, df2])
    return df


### Problme 13
#### 2889. Reshape Data: Pivot [URL](https://leetcode.com/problems/reshape-data-pivot/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:  
The task is to pivot the data so that each row represents temperatures for a specific month, and each city is a separate column.

<h4> The Code</h4>

In [None]:
import pandas as pd

def pivotTable(weather: pd.DataFrame) -> pd.DataFrame:
    pivoted = weather.pivot(index='month', columns='city', values='temperature')
    
    pivoted.columns.name = None 
    return pivoted


### Problem 14
#### 2890. Reshape Data: Melt [URL](https://leetcode.com/problems/reshape-data-melt/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

___Description___:
The task is to reshape the data so that each row represents sales data for a product in a specific

<h4> The code</h4>

In [None]:
import pandas as pd

def meltTable(report: pd.DataFrame) -> pd.DataFrame:
    df = report.melt(id_vars=['product'], var_name='quarter', value_name='sales')
    return df


#### ___Note___
the difference between `join` and `melt` and `pivot`

### 1. Join
- ___Purpose___: The `join` operation is used to combine two or more DataFrames based on a common column or index. It’s commonly used when you want to merge two datasets based on a key column (or index).

- ___Use Case___: You would use `join` when you want to add additional columns from another DataFrame to an existing one based on a common identifier. This is similar to SQL joins (inner, outer, left, right).

- ___Example___: Suppose you have two DataFrames: one with student names and IDs, and another with student IDs and their grades. You would use `join` to combine these based on the `student_id`.

In [None]:
df1 = pd.DataFrame({'student_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'student_id': [1, 2, 3], 'grade': [85, 90, 88]})

result = df1.set_index('student_id').join(df2.set_index('student_id'))


### 2. Melt
- ___Purpose___: The `melt` operation is used to transform a DataFrame from a wide format to a long format. In this case, columns are unpivoted (collapsed into rows) to create a more normalized or tidy structure, where you have one column for values and one column for the variable names.

- ___Use Case___: You use `melt` when you need to convert a DataFrame with multiple columns for variables (e.g., sales per quarter) into a format where each row represents one variable value for a specific instance. It's typically used when you want to analyze the data by categories or time periods (like quarters or months).

- ___Example___: Suppose you have sales data for different products across four quarters, and you want to reshape it so that each row represents sales for a product in a specific quarter.

In [None]:
df = pd.DataFrame({
    'product': ['A', 'B'],
    'quarter_1': [100, 200],
    'quarter_2': [150, 250],
    'quarter_3': [200, 300],
    'quarter_4': [250, 350]
})

result = df.melt(id_vars=['product'], var_name='quarter', value_name='sales')


The result will have columns `product`, `quarter`, and `sales` in long format.

### 3. Pivot
- ___Purpose___: The `pivot` operation is used to reshape data in the opposite direction of `melt`. It’s used to convert a long-format DataFrame back into a wide format by creating a new column for each unique value in a categorical column (like `city`), with the values filled in the corresponding cells (like `temperature`).

- ___Use Case___: You would use `pivot` when you want to convert a long-format DataFrame back to a wide format, where each row corresponds to a specific index (e.g., months) and each column corresponds to a category (e.g., cities).

- ___Example___: Suppose you have temperature data for different cities and months, and you want to pivot it so that each city is a separate column, and each row represents temperatures for a specific month.

In [None]:
df = pd.DataFrame({
    'city': ['Jacksonville', 'Jacksonville', 'ElPaso', 'ElPaso'],
    'month': ['January', 'February', 'January', 'February'],
    'temperature': [13, 23, 20, 6]
})

result = df.pivot(index='month', columns='city', values='temperature')


The result will have `month` as the index, with columns for each city (`ElPaso`, `Jacksonville`), and the values filled with temperature data.


| Operation  | Purpose                                     | When to Use                                                      | Example                                                                 |
|------------|---------------------------------------------|------------------------------------------------------------------|-------------------------------------------------------------------------|
| **Join**   | Combines two DataFrames on a common column. | When you need to add columns from one DataFrame to another based on a key column. | Combining customer information with order details using a common `customer_id`. |
| **Melt**   | Converts wide-format data to long-format.   | When you need to "unpivot" multiple columns into a single column for analysis by categories. | Converting sales data from multiple quarters into a long format.       |
| **Pivot**  | Converts long-format data to wide-format.  | When you need to "pivot" a DataFrame back to a wide format, with columns for categories and rows for instances. | Reshaping weather data to show each city as a separate column with temperature values. |


### Problem 15 
#### 2891. Method Chaining [URL](https://leetcode.com/problems/method-chaining/description/?envType=study-plan-v2&envId=introduction-to-pandas&lang=pythondata)

You are given a DataFrame `animals` containing information about animals with the following columns:

- `name`: The name of the animal.
- `species`: The species of the animal.
- `age`: The age of the animal.
- `weight`: The weight of the animal.
The task is to list the names of animals that weigh strictly more than 100 kilograms, sorted by their weight in descending order.


To solve this problem, we use method chaining in Pandas, where multiple operations are performed on a DataFrame in a single line of code.

- 1 ___Filter___: We first filter the animals that weigh more than 100 kilograms using `animals[animals['weight'] > 100]`.
- 2 ___Sort___: Then, we sort the filtered DataFrame by the `weight` column in descending order using `.sort_values(by='weight', ascending=False)`.
- 3 ___Select `name` column___: Finally, we extract and return only the `name` column of the sorted DataFrame with `[[name']]`.

In [None]:
import pandas as pd

def findHeavyAnimals(animals: pd.DataFrame) -> pd.DataFrame:
    return animals[animals['weight'] > 100].sort_values(by='weight', ascending=False)[['name']]


<h2> This is every Thing ❤</h2>