In [12]:
import pandas as pd
import numpy as np

# PANDAS

## The Series object

It's a one-dimensional array-like object containing a sequence of values of the same type and an associated array of data labels called its index.



In [18]:
s1 = pd.Series(['Alice', 'Jack', 'Molly'])
print(s1)
print(s1.array)
print(s1.index)

0    Alice
1     Jack
2    Molly
dtype: object
<NumpyExtensionArray>
['Alice', 'Jack', 'Molly']
Length: 3, dtype: object
RangeIndex(start=0, stop=3, step=1)


^

So its array is actually a numpy array (that can also contain special extension array types).

Its index is determined by what we pass as an index (if we pass anything).

Here we didn't pass anything so the index is a range (RangeIndex).

In [5]:
s2 = pd.Series([0, 1, 2, 3])
s2

0    0
1    1
2    2
3    3
dtype: int64

^

pandas automatically infers the type based on the data we provide.

What about how pandas handles missing data?

In [6]:
s3 = pd.Series(['Alice', 'Jack', None])
s3

0    Alice
1     Jack
2     None
dtype: object

For strings, it doesn't change too much and pandas actually stores None for the missing value.

What about for numbers?

In [8]:
s4 = pd.Series([None, 1, 2, 3])
s4

0    NaN
1    1.0
2    2.0
3    3.0
dtype: float64

pandas (actually Numpy) stores the missing value as NaN, which is different from None.

Underneath, pandas represents NaN as a floating point number

In [17]:
print(np.nan == None)
print(np.nan == np.nan)
print(np.isnan(np.nan))

False
False
True


In [20]:
scores = {
    "Alice": 3.45,
    "Max": 3.6,
    "Jimmy": 3
}

s5 = pd.Series(scores)
print(s5)
print(s5.array)
print(s5.index)

Alice    3.45
Max      3.60
Jimmy    3.00
dtype: float64
<NumpyExtensionArray>
[3.45, 3.6, 3.0]
Length: 3, dtype: float64
Index(['Alice', 'Max', 'Jimmy'], dtype='object')


In [22]:
# Alternative creation

s6 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
s6

d    4
b    7
a   -5
c    3
dtype: int64

In [23]:
# We can use the labels to access the data
print(s6["a"])
print(s6[["a", "b", "c"]])

-5
a   -5
b    7
c    3
dtype: int64


In [26]:
# We can use Boolean indexing because it's numpy
print(s6[s6 > 0])

# We can do math operations
print(s6**2)

print(np.exp(s6))

d    4
b    7
c    3
dtype: int64
d    16
b    49
a    25
c     9
dtype: int64
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


In [27]:
print("b" in s6)
print("e" in s6)

True
False


In [29]:
s6.isna() # check for missing data

d    False
b    False
a    False
c    False
dtype: bool

In [30]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

s7 = pd.Series(sdata)
s7

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [31]:
states = ["California", "Ohio", "Oregon", "Texas"]
# We can do this now: provide the index and the resulting Series will keep the order in our index
s8 = pd.Series(sdata, index=states)
s8

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [32]:
# And now we can do something cool:
s7 + s8

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [33]:
# We can alter the "column" name as well as the index name:
s8.name = "population"
s8.index.name = "state"
s8

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### Indexing

An important note: **iloc** and **loc** are not **METHODS**, they're attributes, that's why we use square brackets with them.

They are called **indexing operators**

In [41]:
# Using iloc (numerical indexing)
print(s7.iloc[0])
# Using loc (text label)
print(s7.loc["Utah"])

35000
5000


In [44]:
print(s5)
print(np.sum(s5)/len(s5))

Alice    3.45
Max      3.60
Jimmy    3.00
dtype: float64
3.35


In [45]:
numbers = pd.Series(np.random.randint(0,2000,10000))
len(numbers)

10000

In [48]:
%%timeit -n 100
# Let's timeit to see just how much faster vectorized ops are
total = 0
for number in numbers:
    total+=number

total/len(numbers)

870 μs ± 125 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [49]:
%%timeit -n 100
total = np.sum(numbers)
total/len(numbers)

55.7 μs ± 5.78 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## The Dataframe object

If a Series is a one-dimensional structure, we can look at a Dataframe as a two-dimensional structure.

It can be thought of as a dictionary of Series all sharing the same index.

In [34]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df1 = pd.DataFrame(data)
df1

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [50]:
record1 = pd.Series({
    "name": "Alice",
    "class": "Physics",
    "score": "85"
})
record2 = pd.Series({
    "name": "Jack",
    "class": "Chemistry",
    "score": "70"
})
record3 = pd.Series({
    "name": "Helen",
    "class": "Biology",
    "score": "90"
})

df2 = pd.DataFrame([record1, record2, record3], index=["school1", "school2", "school1"])
df2

Unnamed: 0,name,class,score
school1,Alice,Physics,85
school2,Jack,Chemistry,70
school3,Helen,Biology,90
