# Pandas - Basic , Data Structures, Load Data

## Why & What is Pandas

* A column oriented data analysis API (Info stored column wise - features)
* Useful for handling and analyzing input data
* Many ML & DL Framework support pandas data structures as Input
* Simplest way to work in a excel familiar format

WHY

* Easy to use and make code clean
* Comes out of the box with many handy tools for data manipulation and wrangling
* Very similar to excel usage and some of its own
* Handles very large dataset around 5 million rows +

## Load Pandas

In [13]:
!pip install pandas



In [3]:
# import pandas
import pandas as pd
pd.__version__

'2.2.2'

## Pandas Data Structures

* Data Frame
    - Your familiar excel tables
    - Relational Data Table (there is a relation between rows and columns)
    - Rows (observation) & Name Columns (Feature Per Observation)

* Series
    - Single Column
    - Each Row labeled via index
    - Multiple named series forms a Dataframe

NOTE: Data Frame is commonly used abstraction for data manipulation tasks & Similar implementation exist in Spark & R


In [18]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame = True)

print(type(diabetes['data']))

<class 'pandas.core.frame.DataFrame'>


In [19]:
?load_diabetes

[1;31mSignature:[0m [0mload_diabetes[0m[1;33m([0m[1;33m*[0m[1;33m,[0m [0mreturn_X_y[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mas_frame[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m [0mscaled[0m[1;33m=[0m[1;32mTrue[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Load and return the diabetes dataset (regression).

Samples total    442
Dimensionality   10
Features         real, -.2 < x < .2
Targets          integer 25 - 346

.. note::
   The meaning of each feature (i.e. `feature_names`) might be unclear
   (especially for `ltg`) as the documentation of the original dataset is
   not explicit. We provide information that seems correct in regard with
   the scientific literature in this field of research.

Read more in the :ref:`User Guide <diabetes_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.



In [21]:
diabetes.data.head() # dataframe example

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [24]:
diabetes.data.age # series example

0      0.038076
1     -0.001882
2      0.085299
3     -0.089063
4      0.005383
         ...   
437    0.041708
438   -0.005515
439    0.041708
440   -0.045472
441   -0.045472
Name: age, Length: 442, dtype: float64

## Load Data In Pandas

### From Series

In [25]:
# dataframe with 2 series Fruits & Quantity

fruits = pd.Series(['Apple', 'Orange', 'Mango', 'Banana'])
qnty = pd.Series([10, 20, 30,50])

# create a dataframe
fruits_df = pd.DataFrame({'Fruits': fruits, 'Quantity': qnty})

In [26]:
# sanity check
type(fruits_df)

pandas.core.frame.DataFrame

In [27]:
# dataframe
fruits_df

Unnamed: 0,Fruits,Quantity
0,Apple,10
1,Orange,20
2,Mango,30
3,Banana,50


### From Dict of List

In [28]:
# list of data per column

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

human_df1 = pd.DataFrame(data)
print(type(human_df1))

human_df1



<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


### From List of Dicts

In [8]:
# column act as keys

data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

human_df2 = pd.DataFrame(data)
print(type(human_df2))

human_df2

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


### From List of List (less common)

In [9]:
# create a list of list (matrix) and define column name later
# each data point refer to a column value

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]

human_df3 = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(type(human_df3))

human_df3

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


### From List of Tuples (non common)

In [10]:
# create a list of tuples and add column names later, similar to list of list approach
data = [
    ('Alice', 25, 'New York'),
    ('Bob', 30, 'Los Angeles'),
    ('Charlie', 35, 'Chicago')
]

human_df4 = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(type(human_df4))

human_df4

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


### From Numpy Array

In [11]:
# very similar to list of list approach , but here we use numpy array (more memory optimised - mostly used for scientific demonstrations - quick creating optimised data)

import numpy as np

data = np.array([
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
])

human_df5 = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(type(human_df5))

human_df5

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


### From CSV / Excel

In [30]:
mnist_df = pd.read_csv('mnist_test.csv')
print(type(mnist_df))

mnist_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,7,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.658,0.659,0.660,0.661,0.662,0.663,0.664,0.665,0.666,0.667
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---

**NOTE**

These are fundamental once, there many more methods that can be used like load form json, tsv, parquet etc.

Highly recomend to explore and learn from [docs](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) / u can always refer online for help