# Intro to Pandas: Series and DataFrame
Pandas is built on top of NumPy and provides higher-level data structures and data analysis tools, making it more suitable for working with structured or tabular data.

I reviewed:
- Series
- DataFrames
- Extras: indexes as sequences, sizes, and data types.

In [1]:
import pandas as pd
import numpy as np

## Series

`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

`s = pd.Series(data, index=index)`

Here, data can be many different things:

- a Python dict 
- an ndarray
- a scalar value (like 5)

In [7]:
# From ndarray
# You can place your own labels using index
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

a   -0.055945
b    1.281219
c    0.669314
d    1.690967
e   -0.440082
dtype: float64

In [8]:
# from dict
# the labels are the keys of the dictionary
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [9]:
# From scalar value
# The values is repeated according the # of values in index
pd.Series(5.0, index=["a", "b", "c", "d", "e"])


a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

#### Series as ndarray-like and dict-like

`Series` can behave as ndarrays or dicts

In [11]:
#as ndarrays
# indexing
s[s > s.median()]

b    1.281219
d    1.690967
dtype: float64

In [12]:
# indexing
s[[4, 3, 1]]

e   -0.440082
d    1.690967
b    1.281219
dtype: float64

In [13]:
# applying numpy functions
np.exp(s)

a    0.945591
b    3.601027
c    1.952898
d    5.424723
e    0.643984
dtype: float64

In [14]:
# to get the associated numpy
s.to_numpy()

array([-0.0559452 ,  1.28121903,  0.66931447,  1.6909669 , -0.44008155])

In [15]:
# as dict
s["a"]

-0.05594520166328655

In [16]:
s["e"] = 12.0
s

a    -0.055945
b     1.281219
c     0.669314
d     1.690967
e    12.000000
dtype: float64

In [17]:
"e" in s

True

#### Difference between Series and ndarray
A key difference between `Series` and `ndarray` is that operations between `Series` **automatically align the data based on label**. Thus, you can write computations without giving consideration to whether the `Series` involved have the same labels.

In [4]:
s = pd.Series(np.random.randn(5))
s

0    0.999543
1   -0.211492
2    1.685679
3    0.723839
4    0.556657
dtype: float64

In [5]:
s[1:] + s[:-1]

0         NaN
1   -0.422984
2    3.371359
3    1.447678
4         NaN
dtype: float64

#### Label Aligment
An important thing to take into account is the label aligment. It indicates that 
the operations are performed between elements with the same label. If some label
is missing then the operation doesn't proceed and return a NaN value instead.

#### Name attribute
`Series` also has a `name` attribute, that can is automatically assigned. By 
taking the column of a DataFrame the name will be name of the column. And you
can rename the Series using `pandas.Series.rename()`

In [19]:
s = pd.Series(np.random.randn(5), name="something")
s

0    0.870639
1    0.949211
2    0.914593
3   -0.490561
4    0.307311
Name: something, dtype: float64

In [20]:
s2 = s.rename("different")
s2

0    0.870639
1    0.949211
2    0.914593
3   -0.490561
4    0.307311
Name: different, dtype: float64

## DataFrame

`DataFrame` is a data structure that contains:

1. Data organized in two dimension, rows and columns
2. Labels that corresponds to the rows and columns

Note: `index` are the identifier for row labels and `columns` the identifier
for column labels

You can create DataFrame from different sources along with the `DataFrame` 
constructor:

- Python dictionaries
- Python list
- Two-dimensional Numpy arrays
- Files

Note: You can convert a Series into a DataFrame using `s.to_frame()`

In [25]:
# from dictionaries
# the keys indicates the column labels
d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100}
pd.DataFrame(d)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


In [27]:
pd.DataFrame(d, index=[100,200,300], columns=['z', 'y', 'x'])

Unnamed: 0,z,y,x
100,100,2,1
200,100,4,2
300,100,8,3


In [28]:
# from list of dicts
# keys are the column labels
l = [{'x': 1, 'y': 2, 'z': 100},
     {'x': 2, 'y': 4, 'z': 100},
     {'x': 3, 'y': 8, 'z': 100}]

pd.DataFrame(l)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


In [29]:
#from list of list
l = [[1, 2, 100],
     [2, 4, 100],
     [3, 7, 100]]

pd.DataFrame(l)

Unnamed: 0,0,1,2
0,1,2,100
1,2,4,100
2,3,7,100


In [30]:
# from numpy arrays
arr = np.random.randn(3,3)

df = pd.DataFrame(arr, columns=['x', 'y','z'])
df

Unnamed: 0,x,y,z
0,-0.089461,-1.269039,1.227437
1,-1.138274,-0.275705,0.109113
2,0.191005,-0.432486,-0.772203


In [34]:
#from a File
pd.read_csv('california_housing_test.csv')

# Note you can use index_col parameter to indicate that the first column is an
# index

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


#### Sharing data among numpy array and DataFrame with copy = False.
You can specify the optional parameter `copy`. When `copy` is set to `False` 
(its default setting), the data from the NumPy array isn’t copied. This means 
that the original data from the array is assigned to the pandas DataFrame. 
If you modify the array, then your DataFrame will change too.


Note: that you can do the same with `.to_numpy()`, which returns a numpy array
associated to the DataFrame. `.to_numpy()` also allows you set `copy` and 
`dtype` parameters.

In [21]:
arr = np.random.randn(4,3)
arr

array([[ 0.16369762, -0.90762996,  0.34460213],
       [-0.43275065,  0.04697371, -0.63478658],
       [ 0.67230189,  1.01114492, -0.22444368],
       [ 0.19148366, -0.63776014, -0.97257583]])

In [22]:
df = pd.DataFrame(arr, columns = ["x", "y", "z"])
df

Unnamed: 0,x,y,z
0,0.163698,-0.90763,0.344602
1,-0.432751,0.046974,-0.634787
2,0.672302,1.011145,-0.224444
3,0.191484,-0.63776,-0.972576


In [23]:
arr[0,0] = 100
df

Unnamed: 0,x,y,z
0,100.0,-0.90763,0.344602
1,-0.432751,0.046974,-0.634787
2,0.672302,1.011145,-0.224444
3,0.191484,-0.63776,-0.972576


### Pandas vs Numpy
Note: While both Pandas and NumPy can perform row and column operations, their design and functionality make them more efficient and convenient for specific types of operations. Pandas is particularly well-suited for working with structured data and performing column-based operations, while NumPy excels in numerical computations and row-based operations on multi-dimensional arrays.

#### Each row and column is a Serie

I can access to a row or colunm and return a Serie data structure.

- `df[<column_name>]` or `df.column_name` (if the column name is a valid string)
retrieves the column as a Serie
- `df.loc[<row_label_name>]` retrieves the row as a Serie

Note: `df.iloc[<absolute position of the row>]` can also be used, but I need
to provide the absolute number of the row.

Explore more in Indexing

In [31]:
df = pd.DataFrame(np.random.rand(3,3), index= [100, 200, 300], columns = ["x", "y", "z"])
df

Unnamed: 0,x,y,z
100,0.331919,0.082381,0.478104
200,0.040851,0.941652,0.950936
300,0.42002,0.137894,0.853725


In [32]:
# column Serie
df.x

100    0.331919
200    0.040851
300    0.420020
Name: x, dtype: float64

In [33]:
# row Serie
# note the indexes are now the column names
df.loc[100]

x    0.331919
y    0.082381
z    0.478104
Name: 100, dtype: float64

### Different ways to create an index [TODO]

`index = pd.date_range("10/1/1999", periods=1100)`

#### `head()`, `tail()`, `describe()`

In some cases, you have a lot of data that is time consuming to display the
entire DataFrame, so you will use `head()` or `tail()` to just display some
rows.

Additionally, you can use `describe()` that returns some stats about your data.
And `info()` prints information about the DataFrame.

In [35]:
df = pd.read_csv("./california_housing_test.csv")
df.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0


In [36]:
df.tail(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
2997,-119.7,36.3,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.1,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0
2999,-119.63,34.42,42.0,1765.0,263.0,753.0,260.0,8.5608,500001.0


In [37]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           3000 non-null   float64
 1   latitude            3000 non-null   float64
 2   housing_median_age  3000 non-null   float64
 3   total_rooms         3000 non-null   float64
 4   total_bedrooms      3000 non-null   float64
 5   population          3000 non-null   float64
 6   households          3000 non-null   float64
 7   median_income       3000 non-null   float64
 8   median_house_value  3000 non-null   float64
dtypes: float64(9)
memory usage: 211.1 KB


## Labels, Types and Sizes

Less important functions, but could be useful in some cases.

In [None]:
data = {
     'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
     'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
              'Manchester', 'Cairo', 'Osaka'],
     'age': [41, 28, 33, 34, 38, 31, 37],
     'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
 }

row_labels = [101, 102, 103, 104, 105, 106, 107]
df = pd.DataFrame(data=data, index=row_labels)

df

Unnamed: 0,name,city,age,py-score
101,Xavier,Mexico City,41,88.0
102,Ann,Toronto,28,79.0
103,Jana,Prague,33,81.0
104,Yi,Shanghai,34,80.0
105,Robin,Manchester,38,68.0
106,Amal,Cairo,31,61.0
107,Nori,Osaka,37,84.0


#### Labels as sequences

You can get the labels as sequences using `df.index` or `df.columns`. You can
change the whole sequence (with a entire replacement), but you can not change
the values one by one.

In [None]:
df.index

Index([101, 102, 103, 104, 105, 106, 107], dtype='int64')

In [None]:
df.columns

Index(['name', 'city', 'age', 'py-score'], dtype='object')

In [None]:
df.index = np.arange(10,17)
df.index

Index([10, 11, 12, 13, 14, 15, 16], dtype='int64')

In [None]:
df.columns[0] = "different" # error

TypeError: Index does not support mutable operations

#### Data Types and Sizes

You have the following handy functions which can help you:

- `df.dtypes` returns a Series with the data type of each column
- `df.astype()` helps us to change the data type.
- `df.ndim` return the dimensions. DataFrame -> 2 and Series -> 1.
- `df.shape` returns a tuple with the number of values per dimension
- `df.size` returns the total number of dimensions.

In [None]:
df.dtypes

name         object
city         object
age           int64
py-score    float64
dtype: object

In [None]:
df = df.astype(dtype = {'py-score' : np.float32})
df.dtypes

name         object
city         object
age           int64
py-score    float32
dtype: object

In [None]:
print(df.ndim)
print(df.shape)
print(df.size)

2
(7, 4)
28


#### Iterating a DataFrame
