# PANDAS
- **PANDAS** is designed to work with tabular or heterogenous data
- **NumPy** is for homogenously typed numerical array data

**Import Conventions**
- import numpy as np
- import pandas as pd

_______________________________________________________________


**Pandas Data Structures**
- Series
    * 1-D array-like objec containing a sequence of values of the same type
    * It also contain an associaed array of data labels ---> Index

- DataFrame
    * It represents a rectangular table of data and contains an ordered, named collection of columns
    * each of which can be a different value type (numeric, string, Boolean, etc)
    * it has both row and column index
    * it can be thought of as a dictionary of Series all sharing the same index
    * Possible data inputs to the DataFrame constructor

___________________________

**Index Objects**
- It's responsible for holding the axis labels and other metadata
- Index method and properties: https://wesmckinney.com/book/pandas-basics#tbl-table_index_methods


In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

objeck = pd.Series([1, 2, 3, 4, 5])

print(objeck)

print('\nConvert series to array')
print(objeck.array)

0    1
1    2
2    3
3    4
4    5
dtype: int64

Convert series to array
<NumpyExtensionArray>
[1, 2, 3, 4, 5]
Length: 5, dtype: int64


In [5]:
#assign index to series 
objeck2 = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
print(objeck2)
print(objeck2['e'])
print(objeck2['c'])
print(objeck2['a'])

a    1
b    2
c    3
d    4
e    5
dtype: int64
5
3
1


In [10]:
#math functions
darab = objeck2 * objeck2
compare = objeck2[objeck2 > 3]
expo = np.exp(objeck2)

print(darab)
print(compare)
print(expo)

a     1
b     4
c     9
d    16
e    25
dtype: int64
d    4
e    5
dtype: int64
a      2.718282
b      7.389056
c     20.085537
d     54.598150
e    148.413159
dtype: float64


**Create Series from Dictionaries**

In [12]:
state_pop_data = {"Johor": 4125000,
    "Kedah": 2706000,
    "Kelantan": 1917000,
    "Melaka": 960000,
    "Negeri Sembilan": 1248000,
    "Pahang": 1752000,
    "Penang": 1823000,
    "Perak": 2513000,
    "Perlis": 259000,
    "Sabah": 4336000,
    "Sarawak": 2842000,
    "Selangor": 6698000,
    "Terengganu": 1252000,}

objeck3 = pd.Series(state_pop_data)
objeck3

Johor              4125000
Kedah              2706000
Kelantan           1917000
Melaka              960000
Negeri Sembilan    1248000
Pahang             1752000
Penang             1823000
Perak              2513000
Perlis              259000
Sabah              4336000
Sarawak            2842000
Selangor           6698000
Terengganu         1252000
dtype: int64

**Convert Dictionaries back to series**

In [13]:
objeck3.to_dict()

{'Johor': 4125000,
 'Kedah': 2706000,
 'Kelantan': 1917000,
 'Melaka': 960000,
 'Negeri Sembilan': 1248000,
 'Pahang': 1752000,
 'Penang': 1823000,
 'Perak': 2513000,
 'Perlis': 259000,
 'Sabah': 4336000,
 'Sarawak': 2842000,
 'Selangor': 6698000,
 'Terengganu': 1252000}

In [19]:
states = ["Johor","Kedah","Kelantan","Melaka","Negeri Sembilan","Pahang",
    "Penang","Perak","Perlis","Sabah","Sarawak","Selangor","Terengganu", 
    "Kuala Lumpur", "Labuan", "Putrajaya"]

objeck4 = pd.Series(state_pop_data, index=states)

print(objeck4)

Johor              4125000.0
Kedah              2706000.0
Kelantan           1917000.0
Melaka              960000.0
Negeri Sembilan    1248000.0
Pahang             1752000.0
Penang             1823000.0
Perak              2513000.0
Perlis              259000.0
Sabah              4336000.0
Sarawak            2842000.0
Selangor           6698000.0
Terengganu         1252000.0
Kuala Lumpur             NaN
Labuan                   NaN
Putrajaya                NaN
dtype: float64


**Note for above example:**

- I am passing an index with dictionary keys in a new series
- My state List has new additional states which the rpevious series dont have num of population for
- the new series created will return NaN values for states with missing values

**Assing name to series objects**

In [20]:
objeck4.name = 'Population'
objeck4.index.name = "Negeri"
objeck4

Negeri
Johor              4125000.0
Kedah              2706000.0
Kelantan           1917000.0
Melaka              960000.0
Negeri Sembilan    1248000.0
Pahang             1752000.0
Penang             1823000.0
Perak              2513000.0
Perlis              259000.0
Sabah              4336000.0
Sarawak            2842000.0
Selangor           6698000.0
Terengganu         1252000.0
Kuala Lumpur             NaN
Labuan                   NaN
Putrajaya                NaN
Name: Population, dtype: float64

**A series's index can be altered in place by assignment:**

In [24]:

objeck2.index = ["blue", "yellow", "red", "pink", "grey"]
print(objeck2)

blue      1
yellow    2
red       3
pink      4
grey      5
dtype: int64


# DataFrame

- we are going to use the data we used in series tutorial
- We going to create a new separate lists from teh dictionaries that contain States and thier populations
- add another data list into data frame - Year each state was founded 

In [30]:
print(state_pop_data)

newStates = list(state_pop_data.keys())
population = list(state_pop_data.values())

print(f'\nNegeri:\n{newStates}\n')
print(f'Jumlah Penduduk:\n {population}\n')

newData = {
    "states": newStates, 
    "population": population, 
    "Year_founded": [1855, 1136, 1820, 1400, 1773, 1881, 1786, 1820, 1843, 1881, 1841, 1766, 1303]
}

frame = pd.DataFrame(newData)
frame

{'Johor': 4125000, 'Kedah': 2706000, 'Kelantan': 1917000, 'Melaka': 960000, 'Negeri Sembilan': 1248000, 'Pahang': 1752000, 'Penang': 1823000, 'Perak': 2513000, 'Perlis': 259000, 'Sabah': 4336000, 'Sarawak': 2842000, 'Selangor': 6698000, 'Terengganu': 1252000}

Negeri:
['Johor', 'Kedah', 'Kelantan', 'Melaka', 'Negeri Sembilan', 'Pahang', 'Penang', 'Perak', 'Perlis', 'Sabah', 'Sarawak', 'Selangor', 'Terengganu']

Jumlah Penduduk:
 [4125000, 2706000, 1917000, 960000, 1248000, 1752000, 1823000, 2513000, 259000, 4336000, 2842000, 6698000, 1252000]



Unnamed: 0,states,population,Year_founded
0,Johor,4125000,1855
1,Kedah,2706000,1136
2,Kelantan,1917000,1820
3,Melaka,960000,1400
4,Negeri Sembilan,1248000,1773
5,Pahang,1752000,1881
6,Penang,1823000,1786
7,Perak,2513000,1820
8,Perlis,259000,1843
9,Sabah,4336000,1881


**Head method**
- it will select only the first 5 rows

**Tail method**
- it will return the last 5 rows

In [31]:
frame.head()

Unnamed: 0,states,population,Year_founded
0,Johor,4125000,1855
1,Kedah,2706000,1136
2,Kelantan,1917000,1820
3,Melaka,960000,1400
4,Negeri Sembilan,1248000,1773


In [32]:
frame.tail()

Unnamed: 0,states,population,Year_founded
8,Perlis,259000,1843
9,Sabah,4336000,1881
10,Sarawak,2842000,1841
11,Selangor,6698000,1766
12,Terengganu,1252000,1303


**SPECIFY SEQUENCE OF COLUMNS**


In [40]:
print(frame)

newSequence = pd.DataFrame(newData, columns = ['Year_founded', 'states', 'population', "City"])
print(newSequence)

             states  population  Year_founded
0             Johor     4125000          1855
1             Kedah     2706000          1136
2          Kelantan     1917000          1820
3            Melaka      960000          1400
4   Negeri Sembilan     1248000          1773
5            Pahang     1752000          1881
6            Penang     1823000          1786
7             Perak     2513000          1820
8            Perlis      259000          1843
9             Sabah     4336000          1881
10          Sarawak     2842000          1841
11         Selangor     6698000          1766
12       Terengganu     1252000          1303
    Year_founded           states  population City
0           1855            Johor     4125000  NaN
1           1136            Kedah     2706000  NaN
2           1820         Kelantan     1917000  NaN
3           1400           Melaka      960000  NaN
4           1773  Negeri Sembilan     1248000  NaN
5           1881           Pahang     1752000  NaN

**Retrieve series from data frame**

In [41]:
frame 

Unnamed: 0,states,population,Year_founded
0,Johor,4125000,1855
1,Kedah,2706000,1136
2,Kelantan,1917000,1820
3,Melaka,960000,1400
4,Negeri Sembilan,1248000,1773
5,Pahang,1752000,1881
6,Penang,1823000,1786
7,Perak,2513000,1820
8,Perlis,259000,1843
9,Sabah,4336000,1881


In [42]:
frame["states"]

0               Johor
1               Kedah
2            Kelantan
3              Melaka
4     Negeri Sembilan
5              Pahang
6              Penang
7               Perak
8              Perlis
9               Sabah
10            Sarawak
11           Selangor
12         Terengganu
Name: states, dtype: object

In [43]:
frame['Year_founded']

0     1855
1     1136
2     1820
3     1400
4     1773
5     1881
6     1786
7     1820
8     1843
9     1881
10    1841
11    1766
12    1303
Name: Year_founded, dtype: int64

**Loc**
- purpose = access data by label or boolean array
- label-based indexing
- inclusive

**iloc**
- purpose = access data by integer position
- position based indexing
- exclusive 

In [44]:
frame.loc[1]

states            Kedah
population      2706000
Year_founded       1136
Name: 1, dtype: object

In [57]:
iloc_example = pd.DataFrame({
    "x": [9, 8, 7],
    'y': [5, 4, 3]
}, index = ['a', 'b', 'c'])

print(iloc_example)

print('\nValue at second row, first column:')
print(iloc_example.iloc[1,0])

print('\nValue at first row, second column:')
print(iloc_example.iloc[0,1])


   x  y
a  9  5
b  8  4
c  7  3

Value at second row, first column:
8

Value at first row, second column:
5
