# Pandas Tutorial for Beginners and Advanced Level

*Definition*: Pandas is an open-source Python library used for data manipulation, analysis, and cleaning. It provides fast, flexible, and expressive data structures like ```DataFrame``` and ```Series``` to work with structured data efficiently.

### Key Features:
* Handles structured data (CSV, Excel, SQL, JSON, etc.).
* Powerful data operations (filtering, grouping, merging, reshaping).
* Built-in handling of missing data (NaN values).
* Supports time-series analysis and multi-indexing.
* Integrates with NumPy, Matplotlib, and Scikit-learn for data science tasks.

### Usage:

```import pandas as pd```
### What kind of data does pandas handle?
Pandas can handle a wide variety of structured and semi-structured data types, including
- CSV
- Excel
- SQL
- HTML
- XML
- JSON 
- Time series data
- Textual data












Pandas primarily uses two data structures:

* ```Series```: A one-dimensional array-like object (e.g., a single column of data).

* ```DataFrame```: A two-dimensional table with rows and columns (like a spreadsheet).



### How to Read and Write Tabular Data?
* Pandas provides functions to read and write data in various formats.
* To read data we use ``` pd.read_*("path/to/dataset/data.*") ```
* To write data we use ```pd.to_*("path/to/dataset/data.*") ```
  
  -- Note: * representes the data types, for example if the data is ```.csv``` we use ``` pd.read_csv("path/to/dataset/data.csv") ``` and if it excel data we replace .xlsx and so

#### Read data example

In [None]:
import pandas as pd
# Read CSV file
df = pd.read_csv('new_data.csv')
# Read Excel file
df = pd.read_excel('new_data.xlsx')
# Read JSON data
df = pd.read_json('data.json')

##### To read data from sql table, we should create a connection with the database first. 

In [None]:
import pandas as pd
from sqlalchemy import create_engine

# Define database connection (Replace with your details)
db_user = "root"
db_password = ""
db_host = "localhost"  # Or your database server
db_name = "my_db"

In [7]:
# Create engine
engine = create_engine(f"mysql+mysqlconnector://{db_user}:{db_password}@{db_host}/{db_name}")

In [8]:
# Read data from a table into Pandas DataFrame
table_name = "users"

In [9]:
df = pd.read_sql(f"SELECT * FROM {table_name}", engine)

In [None]:
df.head()

# Write data examples

In [None]:
# Write to CSV
df.to_csv('output.csv', index=False)
# Write to Excel
df.to_excel('output.xlsx', index=False)
# Write to SQL
df.to_sql('table_name', engine, if_exists='replace')
# Write to JSON
df.to_json('output.json')

In [20]:
new_data = pd.read_excel('../new_data.xlsx')

In [21]:
df = new_data.drop(['phone','first_name','middle_name'], axis=1)

In [24]:
df.to_csv('data.csv', index=False)

In [25]:
df.head()

Unnamed: 0,stu_id,gender,not_d,dept,region,marital_status,age,g_12,college
0,R/2791/06,Male,6,Political Science,Afar,Single,30.0,335,Social Science and Humanities
1,R/2253/06,Male,4,Anesthesia,Afar,Single,30.0,343,Medicine
2,R/1737/06,Male,1,Public Administration,Afar,Single,29.0,435,Business and Economics
3,R/0268/06,Male,2,Construction Engineering,Afar,Single,28.0,385,Institute of Technology
4,R/0400/06,Male,2,Construction Engineering,Afar,Single,28.0,371,Institute of Technology


How to Select a Subset of a DataFrame?

In [28]:
import pandas as pd
data = pd.read_csv('../data/data.csv')

In [29]:
data.head()

Unnamed: 0,stu_id,gender,not_d,dept,region,marital_status,age,g_12,college
0,R/2791/06,Male,6,Political Science,Afar,Single,30.0,335,Social Science and Humanities
1,R/2253/06,Male,4,Anesthesia,Afar,Single,30.0,343,Medicine
2,R/1737/06,Male,1,Public Administration,Afar,Single,29.0,435,Business and Economics
3,R/0268/06,Male,2,Construction Engineering,Afar,Single,28.0,385,Institute of Technology
4,R/0400/06,Male,2,Construction Engineering,Afar,Single,28.0,371,Institute of Technology


Selecting a column

In [None]:
# Single column
df['column_name']

# Multiple columns
df[['column1', 'column2']]

Selecting a row

In [None]:
# By index
df.iloc[0]  # First row
df.loc[0]   # Row with index label 0
# To select specific rows satisfying the condition given, with all the columns.
df[df['column_name'] > 10] 
# To select a specific rows that are greater than 10 and less than 20 with the whole columns
df[(df['column_name'] > 10) & (df['column_name'] < 20)] 

Selecting specific cells

In [None]:
# By row and column index
df.iloc[0, 1]  # First row, second column

# By row index and column label
df.loc[0, 'column_name'] # First row, column named 'column_name'

# By row label and column label
df.loc['row_label', 'column_name']  # Row named 'row_label', column named 'column_name'

Assume that I'm interseted in "first_name", "dept" with "gender"=male
- To extract this subset of the "data" DataFrame, we use location function, .loc
- The first condition before comma within the selection brace, [] stands for row condition
- The list of columns follow after the comma to pick columns of our best interst

In [30]:
new_data = data.loc[data['gender']=='Male', ['g_12','age','dept','gender']]

In [31]:
new_data.head()

Unnamed: 0,g_12,age,dept,gender
0,335,30.0,Political Science,Male
1,343,30.0,Anesthesia,Male
2,435,29.0,Public Administration,Male
3,385,28.0,Construction Engineering,Male
4,371,28.0,Construction Engineering,Male


We aslo use index location to extract specific rows and columns from a given DataFrame

In [39]:
specific_row_column = new_data.iloc[10:50,1:5] # The first range for rows and the second range for columns

In [38]:
specific_row_column.head()

Unnamed: 0,age,dept,gender
11,27.0,Industrial Engineering,Male
12,27.0,Textile Engineering,Male
13,27.0,Industrial Engineering,Male
16,27.0,Construction Engineering,Male
17,27.0,Surveying Engineering,Male
