In [15]:
import numpy as np
import pandas as pd

## 2. Intro to Pandas

### 2.1 Pandas objects

<b>We'll be looking at two main objects of the Pandas library namely, Series and DataFrames.</b>

#### 2.1.1 Series
A series object is one-dimensional array/list of values that are indexed. Think of it like an indexed 'series' of values. 
<br>Let's look at some examples:

In [69]:
# List -> Series
numbers = pd.Series([1, 2, 3.4, 5.67, 8, 0.9])

names = pd.Series(['Alane', 'Ayanna', 'Tyisha', 'Jarvis', 'Tabetha', 'Geoffrey', 'Ken'])

print(numbers, '\n')
print(names)

0    1.00
1    2.00
2    3.40
3    5.67
4    8.00
5    0.90
dtype: float64 

0       Alane
1      Ayanna
2      Tyisha
3      Jarvis
4     Tabetha
5    Geoffrey
6         Ken
dtype: object


As you can see a default index is added to the list of values. Lets add a <b>custom index</b>.

In [38]:
custom_index = 'abcdef'

# Please note how we use the attribute 'values' for a series object
numbers2 = pd.Series(numbers.values, index=list(custom_index))

numbers2

a    1.00
b    2.00
c    3.40
d    5.67
e    8.00
f    0.90
dtype: float64

Let's look at the values in the series objects we've created, using indexes.

In [47]:
print('The second value in series numbers is: ', numbers[1], '\n')

print('The second value in series numbers2 is: ', numbers2['b'])

The second value in series numbers is:  2.0 

The second value in series numbers2 is:  2.0


Another way to address it using indices can be the following:

In [61]:
print('The first three values in series numbers2 are:')

print(numbers2[:'c']) # numbers2['a':'c'] or numbers2[:3] or numbers2[1:3] or numbers2[:-3] work the same

The first three values in series numbers2 are:
a    1.0
b    2.0
c    3.4
dtype: float64


Let's look at another way that we can create a series object: <b>Dictionaries</b>

In [76]:
locations_dict = {0:'California', 1:'New York', 2:'Virginia', 3:'Michigan', 4:'Texas', 5:'Nevada', 6:'Illinois'}

locations = pd.Series(locations_dict)

locations

0    California
1      New York
2      Virginia
3      Michigan
4         Texas
5        Nevada
6      Illinois
dtype: object

#### 2.1.2 Data Frames
Pandas DataFrame object is generally a two-dimensional, size mutable, potentially heterogeneous tabular data with axes that are labeled. It can be considered to be a special form of a Python dictionary or a numpy array.
<br>Let's look at some examples:

In [131]:
# one of the most common ways to create a data frame
age = {0:5, 1:21, 2:12, 3:10, 4:30, 5:13, 6:70}

data1 = pd.DataFrame({'Name': names, 'Age': age, 'Location': locations})

data1

Unnamed: 0,Name,Age,Location
0,Alane,5,California
1,Ayanna,21,New York
2,Tyisha,12,Virginia
3,Jarvis,10,Michigan
4,Tabetha,30,Texas
5,Geoffrey,13,Nevada
6,Ken,70,Illinois


Let's play with the index a little (both the column and row index) while introducing a new way to create a data frame with pandas (using an existing data frame).

In [132]:
data2 = pd.DataFrame(data=data1).set_index('Name', drop=True)

data2

Unnamed: 0_level_0,Age,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alane,5,California
Ayanna,21,New York
Tyisha,12,Virginia
Jarvis,10,Michigan
Tabetha,30,Texas
Geoffrey,13,Nevada
Ken,70,Illinois


Let's create a data frame using <b>numpy arrays</b>.

In [133]:
data3 = pd.DataFrame(np.arange(12).reshape(6, 2), columns=['Even', 'Odd'], index=list(custom_index))

data3

Unnamed: 0,Even,Odd
a,0,1
b,2,3
c,4,5
d,6,7
e,8,9
f,10,11


### 2.2 Selection and Indexing of Data in Pandas

Let's look at a couple of ways we access the columns of a data frame in pandas.
<br><b>Note</b>: For this section, we'll use the '<b>data2</b>' data frame created earlier.

In [134]:
data2['Age']

Name
Alane        5
Ayanna      21
Tyisha      12
Jarvis      10
Tabetha     30
Geoffrey    13
Ken         70
Name: Age, dtype: int64

In [135]:
data2.Age

Name
Alane        5
Ayanna      21
Tyisha      12
Jarvis      10
Tabetha     30
Geoffrey    13
Ken         70
Name: Age, dtype: int64

In [136]:
data2.Age is data2['Age']

True

Both the above usages give the same result.

Now, we'll use this to operate on our data. We'll see how we can create a new column and enter values in that column by operating on an existing column.
<br><br>Let's say that the 'Age' information in the dataset is 10 years old and we need to add a new column that has the adjusted values. Following is how we can accomplish that:

In [137]:
data2['Age_current'] = data2['Age'] + 10

data2

Unnamed: 0_level_0,Age,Location,Age_current
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alane,5,California,15
Ayanna,21,New York,31
Tyisha,12,Virginia,22
Jarvis,10,Michigan,20
Tabetha,30,Texas,40
Geoffrey,13,Nevada,23
Ken,70,Illinois,80


We can also use same sized multiple series of data to perform similar operations.

Now we'll look at some attributes that can be used by a pandas DataFrame object.

In [138]:
# columns
data2.columns

Index(['Age', 'Location', 'Age_current'], dtype='object')

In [139]:
# index
data2.index

Index(['Alane', 'Ayanna', 'Tyisha', 'Jarvis', 'Tabetha', 'Geoffrey', 'Ken'], dtype='object', name='Name')

In [140]:
# values
data2.values

array([[5, 'California', 15],
       [21, 'New York', 31],
       [12, 'Virginia', 22],
       [10, 'Michigan', 20],
       [30, 'Texas', 40],
       [13, 'Nevada', 23],
       [70, 'Illinois', 80]], dtype=object)

In [141]:
# indexing the values
data2.values[1]

array([21, 'New York', 31], dtype=object)

Now, let's look at a bit more sophisticated methods for indexing.
<br>We'll use the following:
1. <b>iloc</b>: simple array like implicit integer indexer
2. <b>loc</b>: uses explict index and column names

In [145]:
# first two columns and all rows except the first
data2.iloc[1:, :2]

Unnamed: 0_level_0,Age,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Ayanna,21,New York
Tyisha,12,Virginia
Jarvis,10,Michigan
Tabetha,30,Texas
Geoffrey,13,Nevada
Ken,70,Illinois


In [146]:
# first two columns and all rows except the first
data2.loc['Tyisha':, :'Location']

Unnamed: 0_level_0,Age,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Tyisha,12,Virginia
Jarvis,10,Michigan
Tabetha,30,Texas
Geoffrey,13,Nevada
Ken,70,Illinois


Let's use what we've learned in this section to apply a mask to our data and output only selected columns like we would do using an SQL query.

In [148]:
data2.loc[data2.Age_current > 25, ['Age_current', 'Location']]

Unnamed: 0_level_0,Age_current,Location
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Ayanna,31,New York
Tabetha,40,Texas
Ken,80,Illinois
