# Introduction to Pandas Objects


## Outline
* Pandas Objects
* Series
* DataFrame
* DataFrame & Series: Common Functionality

## Pandas Objects

The 2 object types used for storing and manipulating data in pandas are:

* Series
* DataFrame

There are other secondary types which you will come accross using more advanced features of the library, but you will always come back to `Series` and `DataFrame`!

> "The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion". _**Getting Started https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html#data-structures**_


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

## Series

A Pandas Series is a one-dimensional array of indexed data.

### Creating a Series

Create a series with a `list` or a numpy `array`

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])

nparray = np.array([0.25, 0.5, 0.75, 1.0])
data = pd.Series(nparray)

print(type(data))
data

<class 'pandas.core.series.Series'>


0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In order to label the values, provide an index for the `Series`.

In [3]:
my_index = list('abcdefg') # list() will turn string into ['a', 'b', ...]

data = pd.Series([10,20,30,40,50,60,70], index=my_index)

data

a    10
b    20
c    30
d    40
e    50
f    60
g    70
dtype: int64

As a convenience, us a `dict` to specify values and their index in a more understandable structure:

In [4]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

pd.Series(population_dict)

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

## DataFrame

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.


### Creating a DataFrame

Create a Dataframe with a list of lists, where each inner list is a row of scalar data values

In [5]:
rows = [
    ['Paul', 32, 'Software Engineer'],
    ['Kristen', 22, 'Data Scientist'],
    ['Stanley', 9, 'Dog']
]

pd.DataFrame(rows)

Unnamed: 0,0,1,2
0,Paul,32,Software Engineer
1,Kristen,22,Data Scientist
2,Stanley,9,Dog


Add column and row labels with the `index` and `columns` arguments.

In [6]:
pd.DataFrame(rows, 
             columns=['Name', 'Age', 'Profession'],
             index=['P1', 'P2', 'A1'])

Unnamed: 0,Name,Age,Profession
P1,Paul,32,Software Engineer
P2,Kristen,22,Data Scientist
A1,Stanley,9,Dog


Again, leverage the `dict` type to conveniently specify column labels, and the associated data in that column.

In [10]:
family_dict = {
    'Name': ['Paul', 'Kristen', 'Stanely'],
    'Age': [32, 22, 9],
    'Profession': ['Software Engineer', 'Data Scientist', 'Dog']
}

pd.DataFrame(family_dict)

Unnamed: 0,Age,Name,Profession
0,32,Paul,Software Engineer
1,22,Kristen,Data Scientist
2,9,Stanely,Dog


**Note**: You can change column and index labels after creation by setting them as attributes on the DataFrame object.

In [11]:
df = pd.DataFrame(family_dict)
df.columns = ['days since getting a full night\'s sleep', 'Name', 'dream job']
df.index = ['a', 'b', 'c']

df

Unnamed: 0,days since getting a full night's sleep,Name,dream job
a,32,Paul,Software Engineer
b,22,Kristen,Data Scientist
c,9,Stanely,Dog


## DataFrame & Series: Common functionality

The Pandas library is written with consistency and symmetry in mind, so you will see a lot of functionality that exists on both Series and DataFrames. Much of this functionality will be explained later in the course with a focus on DataFrames (which are more complex, and hence more interesting!), but keep in mind the symmetry between Series and DataFrames.

### Head and Tail
View a small portion of data by using `head()` or `tail()`

In [13]:
ser = pd.Series(list('abcdefghijklmnop'))
df = pd.read_csv(Path('C:/Users/kkehrer/Desktop/MachineLearningStorytelling/employee_attrition.csv'))


In [14]:
ser.head()

0    a
1    b
2    c
3    d
4    e
dtype: object

In [15]:
df.tail(3) # pass number of rows to specify something other than 5

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1467,27,No,Travel_Rarely,155,4,2,Male,87,4,2,...,4,2,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,2,4,Male,63,2,2,...,3,4,0,17,3,2,9,6,0,8
1469,34,No,Travel_Rarely,628,8,2,Male,82,4,2,...,3,1,0,6,3,4,4,3,1,2


### Simple selection

Using `[]` you can select values from a `Series` (as if it were a python `list`)

In [16]:
print(ser[2])
print()
print(ser[2:6])

c

2    c
3    d
4    e
5    f
dtype: object


You can do similar operations with DataFrame.  Notice the return types, however!

In [17]:
age = df['Age']

print(type(age))

age

<class 'pandas.core.series.Series'>


0       41
1       49
2       37
3       33
4       27
5       32
6       59
7       30
8       38
9       36
10      35
11      29
12      31
13      34
14      28
15      29
16      32
17      22
18      53
19      38
20      24
21      36
22      34
23      21
24      34
25      53
26      32
27      42
28      44
29      46
        ..
1440    36
1441    56
1442    29
1443    42
1444    56
1445    41
1446    34
1447    36
1448    41
1449    32
1450    35
1451    38
1452    50
1453    36
1454    45
1455    40
1456    35
1457    40
1458    35
1459    29
1460    29
1461    50
1462    39
1463    31
1464    26
1465    36
1466    39
1467    27
1468    49
1469    34
Name: Age, Length: 1470, dtype: int64

In [18]:
rows = df[2:6]

print(type(rows))

rows

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2
5,32,No,Travel_Frequently,1005,2,4,Male,79,3,1,...,3,3,0,8,2,2,7,7,3,6


### shape

quickly check the shape of your data structure.  Notice that `Series.shape` includes only one value (one-dimensional array), whereas `DataFrame.shape` includes both dimensions.

In [19]:
ser.shape

(16,)

In [20]:
df.shape

(1470, 26)

### describe()

Learn some quick statistics about your data.  (Notice that `Series.describe()` is much less verbose than `DataFrame.describe()`

In [21]:
ser.describe()

count     16
unique    16
top        c
freq       1
dtype: object

In [22]:
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.721769,65.891156,2.729932,2.063946,2.728571,6502.931293,2.693197,...,3.153741,2.712245,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.093082,20.329428,0.711561,1.10694,1.102846,4707.956783,2.498009,...,0.360824,1.081209,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,30.0,1.0,1.0,1.0,1009.0,0.0,...,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,48.0,2.0,1.0,2.0,2911.0,1.0,...,3.0,2.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,66.0,3.0,2.0,3.0,4919.0,2.0,...,3.0,3.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,83.75,3.0,3.0,4.0,8379.0,4.0,...,3.0,4.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,4.0,100.0,4.0,5.0,4.0,19999.0,9.0,...,4.0,4.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0
