# Pandas DataFrames

### Pandas Series

In [1]:
import numpy as np
import pandas as pd

Define a new series by passing a collection of homogeneous dta like ndarray or list, along with the list of associated indexes to pd.Series():

In [2]:
my_series = pd.Series(data = [2, 3, 5, 4],         # Data                
                      index= ["a", "b", "c", "d"]) #Indexes

my_series

a    2
b    3
c    5
d    4
dtype: int64

You can also create a series from a dictionary, in which case the dictionary keys act as the labels and the values act as the data

In [3]:
my_dict = {"x":2, "a":5, "b": 4, "c":8}

my_series2 = pd.Series(my_dict)

my_series2

x    2
a    5
b    4
c    8
dtype: int64

Similar to a dictionary, you can access items in a series by the labels

In [4]:
my_series["a"]

2

Numeric indexing also works

In [5]:
my_series[0]

2

If you take a slice of a series, you get both the values and the labels in the slice

In [6]:
my_series[1: 3]

b    3
c    5
dtype: int64

As mentioned earlier, operations performed on two series align by labels

In [8]:
my_series + my_series

a     4
b     6
c    10
d     8
dtype: int64

If you perform an operation with two series that have different labels, the unmatched labels will return a value of NaN(not a number)

In [9]:
my_series + my_series2

a     7.0
b     7.0
c    13.0
d     NaN
x     NaN
dtype: float64

Other than labelling, series behave much like numpy's ndarrays, A series is even a valid argument to many of the numpy array functions we covered last time

In [10]:
np.mean(my_series)

3.5

## DataFrame Creation and Indexing

In [11]:
# Create a dictionary with some different data types as values

my_dict = {
    "name": ["Joe", "Bob", "Kwadwo"],
    "age": np.array([10, 15, 20]),
    "weight": (75, 123, 239),
    "height": pd.Series([4.5, 5, 6.1],
                       index=["Joe", "Bob", "Kwadwo"]),
    "siblings": 1,
    "gender": "M"
}

df = pd.DataFrame(my_dict)

df

Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Kwadwo,Kwadwo,20,239,6.1,1,M


In [12]:
my_dict2 = {
    "name": ["Joe", "Bob", "Kwadwo"],
    "age": np.array([10, 15, 20]),
    "weight": (75, 123, 239),
    "height": [4.5, 5, 6.1],
    "siblings": 1,
    "gender": "M"
}

df2 = pd.DataFrame(my_dict2)

df2

Unnamed: 0,name,age,weight,height,siblings,gender
0,Joe,10,75,4.5,1,M
1,Bob,15,123,5.0,1,M
2,Kwadwo,20,239,6.1,1,M


You can provide custom row labels when creating a DataFrame by adding the index argument

In [13]:
df2 = pd.DataFrame(my_dict2, index=my_dict2["name"])

df2

Unnamed: 0,name,age,weight,height,siblings,gender
Joe,Joe,10,75,4.5,1,M
Bob,Bob,15,123,5.0,1,M
Kwadwo,Kwadwo,20,239,6.1,1,M


A Dataframe behaves like a dictionary of Series objects that each have the same length and the indexes. this means we can get, add and delete columns in a Dataframe the same way we would when dealing with a dictionary

In [15]:
# Get a column by name
df2["weight"]

Joe        75
Bob       123
Kwadwo    239
Name: weight, dtype: int64

Alternatively, you can get a column by label using "dot" notation

In [17]:
df2.weight

Joe        75
Bob       123
Kwadwo    239
Name: weight, dtype: int64

In [21]:
#  Delete a column

del df2["weight"]

In [22]:
df2

Unnamed: 0,name,age,height,siblings,gender
Joe,Joe,10,4.5,1,M
Bob,Bob,15,5.0,1,M
Kwadwo,Kwadwo,20,6.1,1,M


In [23]:
# Add a new column
df2["IQ"] = [130, 105, 115]

In [24]:
df2

Unnamed: 0,name,age,height,siblings,gender,IQ
Joe,Joe,10,4.5,1,M,130
Bob,Bob,15,5.0,1,M,105
Kwadwo,Kwadwo,20,6.1,1,M,115


Inserting a single value into a Dataframe causes it to populate across all the rows

In [25]:
df2["Married"] = False

In [26]:
df2

Unnamed: 0,name,age,height,siblings,gender,IQ,Married
Joe,Joe,10,4.5,1,M,130,False
Bob,Bob,15,5.0,1,M,105,False
Kwadwo,Kwadwo,20,6.1,1,M,115,False


When inserting a series into a DataFrame, rows are matched by index,
Unmatched indexes will be filled with Nan

In [27]:
df["College"] = pd.Series(["Harvard"], index=["Kwadwo"])

In [28]:
df

Unnamed: 0,name,age,weight,height,siblings,gender,College
Joe,Joe,10,75,4.5,1,M,
Bob,Bob,15,123,5.0,1,M,
Kwadwo,Kwadwo,20,239,6.1,1,M,Harvard


You can select both rows and columns by label with df.loc[row, colomn]:

In [29]:
df2.loc["Joe"]  # select row "Joe"

name          Joe
age            10
height        4.5
siblings        1
gender          M
IQ            130
Married     False
Name: Joe, dtype: object

In [30]:
df2.loc["Joe", "IQ"] # select row Joe and column IQ

130

In [35]:
df2.loc["Joe": "Bob", "siblings":"IQ"]  # Slice by label

Unnamed: 0,siblings,gender,IQ
Joe,1,M,130
Bob,1,M,105


Select rows or columns by numeric index with df.iloc[row, column]:

In [36]:
df2.iloc[0]       # Get row 0

name          Joe
age            10
height        4.5
siblings        1
gender          M
IQ            130
Married     False
Name: Joe, dtype: object

In [37]:
df2.iloc[0, 5] # Get row 0, column 5

130

In [39]:
df.iloc[0:2, 4:6]

Unnamed: 0,siblings,gender
Joe,1,M
Bob,1,M


You can select a row by passing in a sequence boolean(True/False) values. Rows where the corresponding boolean is True are returned

In [40]:
boolean_index = [False, True, True]

df2[boolean_index]

Unnamed: 0,name,age,height,siblings,gender,IQ,Married
Bob,Bob,15,5.0,1,M,105,False
Kwadwo,Kwadwo,20,6.1,1,M,115,False


This sort of logical indexing is useful for subsetting data when combined with logical operations.For example, say we wanted to get a subset of our Dataframe with all persons who are over 12 years old. We can do it with boolean indexing

In [43]:
# create a boolean sequence with logical comparisons
boolean_index = df2["age"] > 12

# Use the index to get the rows where age > 12
df2[boolean_index]

Unnamed: 0,name,age,height,siblings,gender,IQ,Married
Bob,Bob,15,5.0,1,M,105,False
Kwadwo,Kwadwo,20,6.1,1,M,115,False


You can do this sort of indexing all in one operation without assigning the boolean sequence to a variable

In [45]:
df2[ df2["age"] > 12]

Unnamed: 0,name,age,height,siblings,gender,IQ,Married
Bob,Bob,15,5.0,1,M,105,False
Kwadwo,Kwadwo,20,6.1,1,M,115,False
