# Pandas Data Structures Basics

This notebook reviews:
* using functions to create and load manual data
* the Series object and its operations
* the DataFrame object and its operations
* conditional subsetting, fancy slicing, and indexing 
* saving data

---

## Create Your Own Data 

Most of the time, we'll be using data from some other source, but it's still useful to know how to build a Dataframe from scratch (e.g. when creating a small test sample).

### Create a Series

A DataFrame can be thought of as a dictionary of Series objects where the key is the column name and the value is the Series. Each value in the Series must be of the same data type. It's similar to a Python list. 

Creating a Series object is as simple as passing a Python list to the *.Series()* method. Make sure they're of the same data type however or some values will be typecast to another.

In [1]:
import pandas as pd

In [3]:
# create Series with a Python list, observe the type conversion
s = pd.Series(['banana', 42])
s

0    banana
1        42
dtype: object

See how there is an index for the Series. By default it's a list of integers; however, we can give the indices a name. 

In [4]:
# assign Series index names via a Python list
s = pd.Series(
    data=['Wes McKinney', 'Creator of Pandas'],
    index=['Person', 'Who']
)
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

### Create a DataFrame 

Python dictionaries are the easiest way of creating a DataFrame. The key is the column name and the values are the column contents. 

In [5]:
# manually create a DataFrame
scientists = pd.DataFrame(
    {
        "Name": ["Rosaline Franklin", "William Gosset"],
        "Occupation": ["Chemist", "Statistician"], 
        "Born": ["1920-07-25", "1876-06-13"],
        "Died": ["1958-04-16", "1937-10-16"],
        "Age": [37, 61],
    }
)

scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


In [6]:
type(scientists)

pandas.core.frame.DataFrame

Passing in a dictionary to the DataFrame constructor will default to having row names with integer values. To change this behavior and select what the names should be, we can use the *index* parameter to set the row names and the *columns* parameter to specify column order.

In the cell below, we use the names of the scientists as the names for the rows instead. 

In [7]:
scientists = pd.DataFrame(
    {
        "Occupation": ["Chemist", "Statistician"], 
        "Born": ["1920-07-25", "1876-06-13"],
        "Died": ["1958-04-16", "1937-10-16"],
        "Age": [37, 61],
    }, 
    index=["Rosaline Franklin", "William Gosset"],
    columns=["Occupation", "Born", "Died", "Age"],
)

scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


--- 

## The Series 

Recall, using *.loc[]* to subset a row of our DataFrame will return a Series object back. 

In [8]:
# our DataFrame with a rows index label
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


In [10]:
# select a row by row index label 
second_row = scientists.loc["William Gosset"]
type(second_row)

pandas.core.series.Series

In [12]:
# show column values for the second row
second_row

Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object

In [13]:
# .index Series attribute returns the index values for the series (column names)
second_row.index

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

In [14]:
# .value Series attribute returns the values stored in the Series
second_row.values

array(['Statistician', '1876-06-13', '1937-10-16', 61], dtype=object)

In [15]:
# .key() Series method is equivalent to .index attribute
second_row.keys()

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

### The Series is ndarray-like

A Series is very similar to a NumPy ndarray, as such many of the methods and functions are shared. 

In [17]:
# get a Series for Age column in our DataFrame using subsetting
ages = scientists['Age']
ages

Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64

In [20]:
# compute ndarray values using shared methods 
ages.mean()

49.0

### Boolean Subsetting: Series

Subsetting by specific indices works for smaller dataset; however, when datasets are larger, we'll often want to subset by looking for values that meet or don't meet some calculation. 

We'll demonstrate this by examining a dataset.

In [21]:
scientists = pd.read_csv("./scientists.csv")

In [22]:
scientists

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [25]:
ages = scientists['Age']
ages

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [26]:
# use the .describe() method to compute multiple descriptive stats at once 
# for a given attribute
ages.describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

**Note:** *.describe()* is one of the Series methods that *will* automatically drop missing values.

Now that we have some descriptive states, we can use them to subset our Series.

In [27]:
# which specific observations are above the mean?...
ages[ages > ages.mean()]

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

The above statement does it all at once, but the following cells show what happens step-by-step using boolean subsetting.

In [30]:
# return a boolean Series to be used as our boolean vector
bv = ages > ages.mean()
type(bv)

pandas.core.series.Series

In [33]:
# subset with the boolean vector (equivalent to cell above)
ages[bv]

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

### Operations are Automatically Aligned and Vectorized (Broadcasting)

Operations on Series and Dataframes using pandas are vectorized operations. 

#### Vectors of Same Length
Vector operations on vectors of the same length result in an element-by-element calculation.

In [34]:
# element-by-element addition 
ages + ages

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [35]:
# element-by-element multiplication
ages * ages

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

#### Vectors with Scalars
Operations between a vector and a scalar will result in the recycling of the scalar for each element in the vector. 

In [36]:
# pandas scalar addition
ages + 100

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64