## Data In NumPy
How does `NumPy` address Python's problems? It uses an `ndarray` which is implemented in `C` and stores data in a contiguous block of memory instead of an object. This means it can do vectorized operations rather than looping over each element in the array

In [2]:
import numpy as np
array = np.arange(10, dtype='int32')
array += 1
print(array)

[ 1  2  3  4  5  6  7  8  9 10]


In [3]:
n1 = np.array([1], dtype='uint8')
n255 = np.array([255], dtype='uint8')
n1 + 255

array([0], dtype=uint8)

In [4]:
demo = np.array([1,2,3,np.nan])
demo

array([ 1.,  2.,  3., nan])

In [5]:
demo.dtype

dtype('float64')

In [6]:
small_values = [1, 99, 127]
large_values = [2**31, 2**63, 2**100]
missing_values = [None, 1, -45]

In [7]:
import pandas as pd
small_ser = pd.Series(small_values)
small_ser

0      1
1     99
2    127
dtype: int64

`NumPy` creates datatype `int64`, a 64-bit block in memory to store an integer. Because the largest value in the `small_values` array doesn't exceed 127, we could store these integers in 8-bit blocks. The datatype would be `int8`

In [8]:
small_ser.astype("int8")

0      1
1     99
2    127
dtype: int8

In [9]:
import pyarrow as pa
small_ser_pa = pd.Series(small_values, dtype='int8[pyarrow]')
small_ser_pa

0      1
1     99
2    127
dtype: int8[pyarrow]

In [10]:
large_ser = pd.Series(large_values)
large_ser

0                         2147483648
1                9223372036854775808
2    1267650600228229401496703205376
dtype: object

In [12]:
large_ser_pa_dtype = pd.Series(large_values, dtype='int64[pyarrow]')

OverflowError: Python int too large to convert to C long

In [14]:
missing_ser = pd.Series(missing_values)
missing_ser

0     NaN
1     1.0
2   -45.0
dtype: float64

In [15]:
missing_ser_pa = pd.Series(missing_values, dtype='int8[pyarrow]')
missing_ser_pa

0    <NA>
1       1
2     -45
dtype: int8[pyarrow]

In [17]:
medium_values = [2**15+5, 2**31-8, 2**63]
medium_ser = pd.Series(medium_values)
medium_ser

0                  32773
1             2147483640
2    9223372036854775808
dtype: uint64

In [18]:
medium_ser.astype('int8')

0    5
1   -8
2    0
dtype: int8

In [19]:
medium_ser.astype('int8[pyarrow]')

ArrowInvalid: Integer value 32773 not in range: 0 to 127

In [None]:
float_vals = [1.5, 2,7, 127.0]
float_missing = [None, 1.5, -45.0]
float_rain = [1.5, 2.7, 0.0, 'T', 1.5, 0]

0     NaN
1     1.5
2   -45.0
dtype: float64

In [21]:
pd.Series(float_vals)

0      1.5
1      2.0
2      7.0
3    127.0
dtype: float64

In [22]:
pd.Series(float_missing)

0     NaN
1     1.5
2   -45.0
dtype: float64

In [23]:
pd.Series(float_missing, dtype='float64[pyarrow]')

0    <NA>
1     1.5
2   -45.0
dtype: double[pyarrow]

In [24]:
pd.Series(float_rain)

0    1.5
1    2.7
2    0.0
3      T
4    1.5
5      0
dtype: object

In [25]:
pd.Series(float_rain).replace('T','0.0').astype('float64')

0    1.5
1    2.7
2    0.0
3    0.0
4    1.5
5    0.0
dtype: float64

In [26]:
pd.Series(float_rain).replace('T',0).astype('float64[pyarrow]')

  pd.Series(float_rain).replace('T',0).astype('float64[pyarrow]')


0    1.5
1    2.7
2    0.0
3    0.0
4    1.5
5    0.0
dtype: double[pyarrow]

## Categorical Data
Categorical data is string data that has *low cardinality*.
Pandas stores categorical data in a separate array and then stores integers that refer to those values

In [27]:
states = ['CA', 'NY', 'TX']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']


In [28]:
pd.Series(states, dtype='category')

0    CA
1    NY
2    TX
dtype: category
Categories (3, object): ['CA', 'NY', 'TX']

In [29]:
pd.Series(months, dtype='category')

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Apr', 'Aug', 'Dec', 'Feb', ..., 'May', 'Nov', 'Oct', 'Sep']

If we sort months, Pandas alphabetizes them:

In [30]:
pd.Series(months, dtype='category').sort_values()

3     Apr
7     Aug
11    Dec
1     Feb
0     Jan
6     Jul
5     Jun
2     Mar
4     May
10    Nov
9     Oct
8     Sep
dtype: category
Categories (12, object): ['Apr', 'Aug', 'Dec', 'Feb', ..., 'May', 'Nov', 'Oct', 'Sep']

To sort the data, create ordered categorical type and pass that type to the dtype parameter

In [31]:
month_cat = pd.CategoricalDtype(categories=months, ordered=True)
pd.Series(months,dtype=month_cat).sort_values()

0     Jan
1     Feb
2     Mar
3     Apr
4     May
5     Jun
6     Jul
7     Aug
8     Sep
9     Oct
10    Nov
11    Dec
dtype: category
Categories (12, object): ['Jan' < 'Feb' < 'Mar' < 'Apr' ... 'Sep' < 'Oct' < 'Nov' < 'Dec']