# Pandas

The pandas library is useful for dealing with ***structured data***.<br>

What is structured data? <br>
Data that is stored in tables, csv files, Excel Spreadsheets or database tables, is all structured.<br>

Unstructured data consists of free form text, images,sound or video.<br>

If you are using structured data pandas will be a great utility to you.


## Importing Pandas

Most users of pandas library will use an import alias so they can refer to it as **pd**

In [3]:
import pandas as pd

# Series

Series is a one-dimensional labeled array capable of holding any data type (integers, floats, strings, objects, etc.). It is essentially a column in a spreadsheet or a single-dimensional NumPy array with additional functionalities.<br>

Key Characteristics:
* One-dimensional: Data is arranged in a single column.
* Labeled: Each element has an associated label (index).
* Immutable: size immutable.

NOTE: When we say that series can hold any data type, we mean to say that the entire column can be of any datatype not individual values in the entire column.

![class 5](series_anatomy.png)
Image Source - Pandas Cookbook

## Creating a Series 

In [12]:
# creating a series from listed data
data = ['a','e','i','o','u']
s = pd.Series(data)
print(f'Series from a list:\n{s}')

# From a NumPy array
import numpy as np
data = np.array([1, 2, 3, 4, 5])
s = pd.Series(data)
print(f'Series from numpy array:\n{s}')

# From a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(f'Series from a dictionary:\n{s}')

Series from a list:
0    a
1    e
2    i
3    o
4    u
dtype: object
Series from numpy array:
0    1
1    2
2    3
3    4
4    5
dtype: int64
Series from a dictionary:
a    1
b    2
c    3
dtype: int64


NOTE: More methods and operation will be talked about in later course.

# Data Frame

## Introduction to Data Frame

* A DataFrame is essentially a two-dimensional labeled data structure with columns of potentially different types.<br>
* In simple terms - DataFrame: A table of data with rows and columns, where each column is a Series.
* Visually They appear like table consisting od *Rows* and *Columns*.<br>
* Hiding beneath the surface are the three components: *`index`*, *`column`*, *`data`*.

![class 6](anatomy_dataframe.png)

*`Index Labels`* and *`Column name`* refer to the individual memeber of index and columns,respectively.<br>
`Index` refers to the Index label as a whole and `Column` refers to the column name as a whole.

The labels in index and column names allow for pulling out data based on the index and column name. The index is also used for *alighment*. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs.

Collectively, the columns and the index are know as the axes.<br>
**Index - Axis 0**<br>
**Columns - Axis 1**

Pandas uses **NaN** (Not a number) *to represent missing values (including to represnt a missing string value)*.

The three consecutive dots, `...` represent that there is atleast one column that exists but could not be displayed due to display limit.

### Creating DataFrames

There are multiple ways to create a dataframe using the DataFrame() object.

In [14]:
# Creating DataFrame using Dictonary 
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(f'DataFrame using Dict:\n{df}')

# Creating DataFrame using numpy arrays
data = np.array([
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Numpy array:\n{df}')

# Creating DataFrame using list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(f'DataFrame using Lists of List:\n{df}')

DataFrame using Dict:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
DataFrame using Numpy array:
      Name Age         City
0    Alice  25     New York
1      Bob  30  Los Angeles
2  Charlie  35      Chicago
DataFrame using Lists of List:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


### DataFrame Attributes

DataFrame attributes provide metadata and basic information about the DataFrame.<br>
`df.shape` - Returns a tuple representing the dimensionality of the DataFrame.<br>
`df.columns` - Returns an Index object containing the column labels.<br>
`df.index` - Returns an Index object containing the row labels.<br>
`df.dtypes` - Returns the data types of each column.<br>
`df.size` - Returns the number of elements in the DataFrame.<br>

In [19]:
data = {
    'Region': ['Europe', 'North America', 'Asia', 'Africa', 'South America'],
    'No. of Tourists': [30000000, 25000000, 45000000, 15000000, 22000000],
    'Average Temperature (F)': [55, 65, 75, 85, 70]
}
df = pd.DataFrame(data)
print('DataFrame:\n',df)
print(f'')
print(f'Shape of Data Frame: {df.shape}')
print(f'Columns of Data Frame: {df.columns}')
print(f'Index of Data Frame: {df.index}')
print(f'Data-types of columns of Data Frame: {df.dtypes}')
print(f'Size of Data Frame: {df.size}')

DataFrame:
           Region  No. of Tourists  Average Temperature (F)
0         Europe         30000000                       55
1  North America         25000000                       65
2           Asia         45000000                       75
3         Africa         15000000                       85
4  South America         22000000                       70

Shape of Data Frame: (5, 3)
Columns of Data Frame: Index(['Region', 'No. of Tourists', 'Average Temperature (F)'], dtype='object')
Index of Data Frame: RangeIndex(start=0, stop=5, step=1)
Data-types of columns of Data Frame: Region                     object
No. of Tourists             int64
Average Temperature (F)     int64
dtype: object
Size of Data Frame: 15


### DataFrame Methods
`head()`: Shows the first n rows of the dataframe.<br>
`tail()`: Shows the last n rows of the dataframe.<br>
`info()`: Provides a summary of the DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.<br>
`describe()`: Generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.<br>
`drop()`: Drops specified labels from rows or columns.<br>
`value_counts()`: Returns a Series containing counts of unique values.


# Quering Rows and Columns

# Operations Between Columns

# IO Operations