# Pandas Data Structures Basics

This notebook reviews:
* using functions to create and load manual data
* the Series object and its operations
* the DataFrame object and its operations
* conditional subsetting, fancy slicing, and indexing 
* saving data

---

## Create Your Own Data 

Most of the time, we'll be using data from some other source, but it's still useful to know how to build a Dataframe from scratch (e.g. when creating a small test sample).

### Create a Series

A DataFrame can be thought of as a dictionary of Series objects where the key is the column name and the value is the Series. Each value in the Series must be of the same data type. It's similar to a Python list. 

Creating a Series object is as simple as passing a Python list to the *.Series()* method. Make sure they're of the same data type however or some values will be typecast to another.

In [1]:
import pandas as pd

In [3]:
# create Series with a Python list, observe the type conversion
s = pd.Series(['banana', 42])
s

0    banana
1        42
dtype: object

See how there is an index for the Series. By default it's a list of integers; however, we can give the indices a name. 

In [4]:
# assign Series index names via a Python list
s = pd.Series(
    data=['Wes McKinney', 'Creator of Pandas'],
    index=['Person', 'Who']
)
s

Person         Wes McKinney
Who       Creator of Pandas
dtype: object

### Create a DataFrame 

Python dictionaries are the easiest way of creating a DataFrame. The key is the column name and the values are the column contents. 

In [5]:
# manually create a DataFrame
scientists = pd.DataFrame(
    {
        "Name": ["Rosaline Franklin", "William Gosset"],
        "Occupation": ["Chemist", "Statistician"], 
        "Born": ["1920-07-25", "1876-06-13"],
        "Died": ["1958-04-16", "1937-10-16"],
        "Age": [37, 61],
    }
)

scientists

Unnamed: 0,Name,Occupation,Born,Died,Age
0,Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
1,William Gosset,Statistician,1876-06-13,1937-10-16,61


In [6]:
type(scientists)

pandas.core.frame.DataFrame

Passing in a dictionary to the DataFrame constructor will default to having row names with integer values. To change this behavior and select what the names should be, we can use the *index* parameter to set the row names and the *columns* parameter to specify column order.

In the cell below, we use the names of the scientists as the names for the rows instead. 

In [7]:
scientists = pd.DataFrame(
    {
        "Occupation": ["Chemist", "Statistician"], 
        "Born": ["1920-07-25", "1876-06-13"],
        "Died": ["1958-04-16", "1937-10-16"],
        "Age": [37, 61],
    }, 
    index=["Rosaline Franklin", "William Gosset"],
    columns=["Occupation", "Born", "Died", "Age"],
)

scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


--- 

## The Series 

Recall, using *.loc[]* to subset a row of our DataFrame will return a Series object back. 

In [8]:
# our DataFrame with a rows index label
scientists

Unnamed: 0,Occupation,Born,Died,Age
Rosaline Franklin,Chemist,1920-07-25,1958-04-16,37
William Gosset,Statistician,1876-06-13,1937-10-16,61


In [10]:
# select a row by row index label 
second_row = scientists.loc["William Gosset"]
type(second_row)

pandas.core.series.Series

In [12]:
# show column values for the second row
second_row

Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object

In [13]:
# .index Series attribute returns the index values for the series (column names)
second_row.index

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

In [14]:
# .value Series attribute returns the values stored in the Series
second_row.values

array(['Statistician', '1876-06-13', '1937-10-16', 61], dtype=object)

In [15]:
# .key() Series method is equivalent to .index attribute
second_row.keys()

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')

### The Series is ndarray-like

A Series is very similar to a NumPy ndarray, as such many of the methods and functions are shared. 

In [17]:
# get a Series for Age column in our DataFrame using subsetting
ages = scientists['Age']
ages

Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64

In [20]:
# compute ndarray values using shared methods 
ages.mean()

49.0

### Boolean Subsetting: Series

Subsetting by specific indices works for smaller dataset; however, when datasets are larger, we'll often want to subset by looking for values that meet or don't meet some calculation. 

We'll demonstrate this by examining a dataset.

In [21]:
scientists = pd.read_csv("./scientists.csv")

In [22]:
scientists

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [25]:
ages = scientists['Age']
ages

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [26]:
# use the .describe() method to compute multiple descriptive stats at once 
# for a given attribute
ages.describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

**Note:** *.describe()* is one of the Series methods that *will* automatically drop missing values.

Now that we have some descriptive states, we can use them to subset our Series.

In [27]:
# which specific observations are above the mean?...
ages[ages > ages.mean()]

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

The above statement does it all at once, but the following cells show what happens step-by-step using boolean subsetting.

In [30]:
# return a boolean Series to be used as our boolean vector
bv = ages > ages.mean()
type(bv)

pandas.core.series.Series

In [33]:
# subset with the boolean vector (equivalent to cell above)
ages[bv]

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

### Operations are Automatically Aligned and Vectorized (Broadcasting)

Operations on Series and Dataframes using pandas are vectorized operations. 

#### Vectors of Same Length
Vector operations on vectors of the same length result in an element-by-element calculation.

In [34]:
# element-by-element addition 
ages + ages

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

In [35]:
# element-by-element multiplication
ages * ages

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64

#### Vectors with Scalars
Operations between a vector and a scalar will result in the recycling of the scalar for each element in the vector. 

In [37]:
# pandas scalar addition
ages + 100

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64

### Vectors with Different Lengths

Broadcasting in pandas refers to how operations are calculated between arrays with different shapes. The resulting shape depends on the type of the vector. For many operations however, the shapes must match. 

In [38]:
# addition won't recycle?...
ages + pd.Series([1, 100])

0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

### Vectors with Common Index Labels (Automatic Alignment)

Pandas automatically aligns data by index label for most actions.

In [39]:
ages

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [41]:
rev_ages = ages.sort_index(ascending=False)
rev_ages

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

By alignment, we mean that the row index label stays with its value as it appears in the original data. 

So, for example, when we add the two vectors ages and rev_ages, it will align the index labels to match before performing the addition. Thus, index label 0 will be matched with index label 0, etc.

In [42]:
# result of auto alignment
ages + rev_ages


0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

--- 

## The DataFrame

Our rectangular data structure, very useful.

The three major parts of a pandas DataFrame are:
* .index - refers to the column names
* .columns - refer to the column names 
* .values - refers to the data values; useful when you just want the numpy representation of the data without the index label information. 



In [43]:
# rows index 
scientists.index

RangeIndex(start=0, stop=8, step=1)

In [44]:
# column names 
scientists.columns

Index(['Name', 'Born', 'Died', 'Age', 'Occupation'], dtype='object')

In [45]:
# data values 
scientists.values

array([['Rosaline Franklin', '1920-07-25', '1958-04-16', 37, 'Chemist'],
       ['William Gosset', '1876-06-13', '1937-10-16', 61, 'Statistician'],
       ['Florence Nightingale', '1820-05-12', '1910-08-13', 90, 'Nurse'],
       ['Marie Curie', '1867-11-07', '1934-07-04', 66, 'Chemist'],
       ['Rachel Carson', '1907-05-27', '1964-04-14', 56, 'Biologist'],
       ['John Snow', '1813-03-15', '1858-06-16', 45, 'Physician'],
       ['Alan Turing', '1912-06-23', '1954-06-07', 41,
        'Computer Scientist'],
       ['Johann Gauss', '1777-04-30', '1855-02-23', 77, 'Mathematician']],
      dtype=object)

### Boolean Subsetting: DataFrames


In [46]:
scientists['Age'].describe()

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64

In [47]:
# return observations where age is greater than the average age
scientists.loc[scientists['Age'] > scientists['Age'].mean()]

Unnamed: 0,Name,Born,Died,Age,Occupation
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


---

## Making Changes to Series and DataFrames

Let's review how to alter our data objects.

### Add Additional Columns 


In [48]:
# check out our dataframe datatypes 
scientists.dtypes

Name          object
Born          object
Died          object
Age            int64
Occupation    object
dtype: object

In [50]:
# convert strings to datetime to perform date and time operations on them
born_datetime = pd.to_datetime(scientists['Born'], format='%Y-%m-%d')
born_datetime

0   1920-07-25
1   1876-06-13
2   1820-05-12
3   1867-11-07
4   1907-05-27
5   1813-03-15
6   1912-06-23
7   1777-04-30
Name: Born, dtype: datetime64[ns]

In [52]:
died_datetime = pd.to_datetime(scientists['Died'], format='%Y-%m-%d')
died_datetime

0   1958-04-16
1   1937-10-16
2   1910-08-13
3   1934-07-04
4   1964-04-14
5   1858-06-16
6   1954-06-07
7   1855-02-23
Name: Died, dtype: datetime64[ns]

Let's add a new set of columns that contain the datetime representations of the object.

In [53]:
# add new columns to a DataFrame via assignment
scientists['born_dt'], scientists['died_dt'] = (
    born_datetime,
    died_datetime
)

In [55]:
scientists.head()

Unnamed: 0,Name,Born,Died,Age,Occupation,born_dt,died_dt
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist,1920-07-25,1958-04-16
1,William Gosset,1876-06-13,1937-10-16,61,Statistician,1876-06-13,1937-10-16
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse,1820-05-12,1910-08-13
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist,1867-11-07,1934-07-04
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist,1907-05-27,1964-04-14


In [56]:
# show we actually changed the datatypes
scientists.dtypes

Name                  object
Born                  object
Died                  object
Age                    int64
Occupation            object
born_dt       datetime64[ns]
died_dt       datetime64[ns]
dtype: object

### Directly Change a Column

To demonstrate directly changing values of a column, let's play around with the ages data. 

So, I don't actually care to directly modify the columns the way presented in the text, so let's just move on...


### Modifying Columns with *.assign()*

(I prefer this way) We can use the *.assign()* method to modify our existing columns. 

To demonstrate its use, let's create two new columns for our Dataframe. 

In [61]:
scientists.head()

Unnamed: 0,Name,Born,Died,Age,Occupation,born_dt,died_dt
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist,1920-07-25,1958-04-16
1,William Gosset,1876-06-13,1937-10-16,61,Statistician,1876-06-13,1937-10-16
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse,1820-05-12,1910-08-13
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist,1867-11-07,1934-07-04
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist,1907-05-27,1964-04-14


In [65]:
scientists = scientists.assign(
    age_days = scientists['died_dt'] - scientists['born_dt']
 )



In [70]:
# had to use a separate guy, since age_days didn't exist until the last guy ran  
# for some reason, resolution won't go to years!!!
scientists = scientists.assign(
    age_years = scientists['age_days'].astype('timedelta64[ns]')
)
    

In [69]:
scientists

Unnamed: 0,Name,Born,Died,Age,Occupation,born_dt,died_dt,age_days,age_years
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist,1920-07-25,1958-04-16,13779 days,13779 days
1,William Gosset,1876-06-13,1937-10-16,61,Statistician,1876-06-13,1937-10-16,22404 days,22404 days
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse,1820-05-12,1910-08-13,32964 days,32964 days
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist,1867-11-07,1934-07-04,24345 days,24345 days
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist,1907-05-27,1964-04-14,20777 days,20777 days
5,John Snow,1813-03-15,1858-06-16,45,Physician,1813-03-15,1858-06-16,16529 days,16529 days
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist,1912-06-23,1954-06-07,15324 days,15324 days
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician,1777-04-30,1855-02-23,28422 days,28422 days


### Dropping Values

There are two common ways to drop columns:
1. select the columns to drop by using subsetting
2. use the *.drop()* method

In [71]:
# current columns in dataframe
scientists.columns

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'born_dt', 'died_dt',
       'age_days', 'age_years'],
      dtype='object')

In [75]:
# let's drop the age_years column 
scientists = scientists.drop(['age_years'], axis="columns")

In [76]:
scientists

Unnamed: 0,Name,Born,Died,Age,Occupation,born_dt,died_dt,age_days
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist,1920-07-25,1958-04-16,13779 days
1,William Gosset,1876-06-13,1937-10-16,61,Statistician,1876-06-13,1937-10-16,22404 days
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse,1820-05-12,1910-08-13,32964 days
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist,1867-11-07,1934-07-04,24345 days
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist,1907-05-27,1964-04-14,20777 days
5,John Snow,1813-03-15,1858-06-16,45,Physician,1813-03-15,1858-06-16,16529 days
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist,1912-06-23,1954-06-07,15324 days
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician,1777-04-30,1855-02-23,28422 days


---

## Exporting and Importing Data

### Pickle

Pickling is how Python serializes and saves data in a binary format.

### Pickling Series 

In [77]:
# save some data in a binary format
names = scientists['Name']
names

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

In [78]:
# pickle it 
names.to_pickle('./scientists_names_series.pickle')

This binary format is only really readable by Python, so if someone else is not using Python, they're SOL. However, if you know this data won't really be leaving the Python world or it's simply an intermediate processing step, then saving it to a pickle format is good. 

The pickle format is optimized for Python and the binary format will also save disk space. 

In [80]:
# we can also pickle entire dataframes
scientists.to_pickle('./scientists_df.pickle')

Reading pickle data is as simple as calling pandas' *.read_pickle()* function. 

In [81]:
# read a pickled Series 
series_pickle = pd.read_pickle("./scientists_names_series.pickle")
series_pickle

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

In [83]:
# read a pickled DataFrame
dataframe_pickle = pd.read_pickle("./scientists_names_series.pickle")
dataframe_pickle

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object

### CSV

You know it. You love it. We already know how to read from a CSV file but pandas also has the *.to_csv()* method that output a Series or DataFrame to a CSV file.

CSV is basically universal, but the tradeoff is that it's slower and takes up more disk space compared to binary formats. 

With pandas DataFrame and Series, we typically want to avoid the extraneous first column when outputting as a CSV because they are issues when trying to read it back into another program or pandas. So, when outputting a file to CSV, just strip the index. 

In [84]:
scientists.to_csv("./scientists_df_no_index.csv", index=False)

### Excel

You know it. You hate it. 

To export a Series to an Excel format, you have to convert it to a one column DataFrame and then call the *.to_excel()* method on that DataFrame.

In [86]:
# output a Series object as an Excel file by converting to DF
names_df = names.to_frame()

# output the dataframe 
names_df.to_excel("./scientists.xls", engine="openpyxl")

In [87]:
# more options, output to a specific sheet name
scientists.to_excel(
    "./scientists_df.xlsx",
    sheet_name="scientists",
    index=False
)

There's a gang of other useful data formats to export/import to/from. Just consult the docs as the need arises. 