# Pandas

- Pandas is a Python package designed to make working with structured (tabular) data fast and flexible.

- It provides two main data structures: **Series** (one-dimensional, like a single column) and **DataFrame** (two-dimensional, like a table of rows and columns).

- Each column in a DataFrame holds only one data type (homogeneous), but different columns can have different types (e.g., numbers and text).

### Why Pandas?
- While Python lists and dictionaries can hold any type of data, they're not optimized for tabular or column-based operations and analysis.

- NumPy arrays are highly efficient for numeric calculations but only support a single data type and lack rich labeling.

- *pandas* DataFrames bridge the gap by allowing each column to be labeled, contain homogeneous values, and handle large, complex datasets efficiently 

In [11]:
# Creating dataframe from a dictionary

import pandas as pd

data = {"A": [1, 2, 3], "B": ['x', 'y', 'z']}
pd.DataFrame(data)

# Every DataFrame has an index—a label for each row.
# By default, the index is numeric (0, 1, 2, ...).

Unnamed: 0,A,B
0,1,x
1,2,y
2,3,z


In [12]:
# Setting index in dataframes

# Set a custom index when creating or using methods
pd.DataFrame(data, index=["row0", "row1", "row2"])

Unnamed: 0,A,B
row0,1,x
row1,2,y
row2,3,z


In [16]:
# Set an index from a column after creation

df = pd.DataFrame(data)
df.set_index("A", inplace=True)
df

Unnamed: 0_level_0,B
A,Unnamed: 1_level_1
1,x
2,y
3,z


### Creating DataFrames from Files and URLs
*pandas* supports reading from many sources:

- CSV: pd.read_csv('filename.csv')
- Excel: pd.read_excel('filename.xlsx')
- Even directly from URLs, e.g., pd.read_csv('http://website.com/data.csv').

### Common DataFrame Methods
- **.info()**:Concise summary of the DataFrame, including index, column types, and non-null counts.
- **.describe()**: Summary statistics for numerical columns (count, mean, std, min, quartiles, max).
- **.head(n)**: Returns the first n rows (default 5), useful for quickly inspecting data.
- **.tail(n)**: Shows the last n rows.

In [18]:
import pandas as pd

data = {"Name": ["Alice", "Bob", "Carol"], "Age": [25, 32, 47]}
df = pd.DataFrame(data)
print(df.info(), "\n")      # Summary: columns, datatypes, missing values
print(df.describe(), "\n")  # Descriptive statistics for columns
print(df.head(2), "\n")     # First two rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 180.0+ bytes
None 

             Age
count   3.000000
mean   34.666667
std    11.239810
min    25.000000
25%    28.500000
50%    32.000000
75%    39.500000
max    47.000000 

    Name  Age
0  Alice   25
1    Bob   32 



In [34]:
# Loading Data from External Sources
# Load data on eductional attainment and personal income from a file 
# [California Open Data](https://data.ca.gov/dataset/ca-educational-attainment-personal-income/resource/26201f19-4469-4311-a819-bbbd3e557eda) portal. 

table = pd.read_csv("test-csv-file.csv")

print(table.info(), "\n")      # Summary: columns, datatypes, missing values
print(table.describe(), "\n")  # Descriptive statistics for columns
print(table.head(10), "\n")     # First two rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Year                    24 non-null     object 
 1   Age                     24 non-null     object 
 2   Gender                  24 non-null     object 
 3   Educational Attainment  24 non-null     object 
 4   Personal Income         24 non-null     object 
 5   Population Count        20 non-null     float64
dtypes: float64(1), object(5)
memory usage: 1.3+ KB
None 

       Population Count
count         20.000000
mean      107540.750000
std       204389.950046
min         1304.000000
25%         3250.250000
50%         7345.000000
75%        65717.250000
max       650889.000000 

                     Year       Age Gender  \
0  01/01/2008 12:00:00 AM  00 to 17   Male   
1  01/01/2008 12:00:00 AM  00 to 17   Male   
2  01/01/2008 12:00:00 AM  00 to 17   Male   
3  01/01/2008

In [32]:
# Load data from a CSV URL
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"

# Load the CSV file into a pandas DataFrame
df = pd.read_csv(url)

# Display the first five rows
print(df.head())

  Month   "1958"   "1959"   "1960"
0   JAN      340      360      417
1   FEB      318      342      391
2   MAR      362      406      419
3   APR      348      396      461
4   MAY      363      420      472


### Why pandas is Essential
*pandas* makes real-world data analysis, cleaning, and exploration practical and efficient for scientists, analysts, and engineers.

Its integration with NumPy and other data science tools makes it a foundational part of the Python data ecosystem.

In [None]:
# Sources:
# [1](https://www.youtube.com/watch?v=EXIgjIBu4EU)
# [2](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)
# [3](https://www.youtube.com/watch?v=2uvysYbKdjM)
# [4](https://www.w3schools.com/python/pandas/default.asp)
# [5](https://pandas.pydata.org/docs/user_guide/10min.html)
# [6](https://www.geeksforgeeks.org/pandas/pandas-tutorial/)
# [7](https://www.reddit.com/r/Python/comments/lain0r/hey_reddit_heres_my_comprehensive_course_on/)
# [8](https://www.youtube.com/playlist?list=PLUaB-1hjhk8GZOuylZqLz-Qt9RIdZZMBE)