<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Pandas: Data Structures and Accessing the Data
              
</p>
</div>

Data Science Cohort Live NYC Nov 2023
<p>Phase 1</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
# Import libraries
import numpy as np
import pandas as pd # import pandas library

Pandas has two core data structures:
- Series: 1D array with native support for many data operations that numpy arrays don't.
- DataFrames: Tabular data with various tabular manipulation operations. Individual columns/rows are pandas Series.

#### Pandas Series

We have data on the highest number of cars that a few famous people have owned. 

| Person | Max number of Cars |
| --- | --- | 
| Muammar Qaddafi | 25000 |
| Mohandas Gandhi | 0 |
| Saddam Hussein | 4500 |
| Kevin Bacon | 2 |
| Billy Bob Thornton | 8 |

Let's represent this as a series.

In [None]:
pd.Series([25000,0,4500,2,8],
          index = ['Muammar Qaddafi', 'Mohandas Gandhi', 'Saddam Hussein', 'Kevin Bacon', 'Billy Bob Thornton'], 
          name = 'Max Number Cars Owned')

In [None]:
# This more naturally can be inputted from a dict.
car_dict = {'Muammar Qaddafi': 25000, 'Mohandas Gandhi': 0, 
            'Saddam Hussein': 4500, 'Kevin Bacon': 2, 'Billy Bob Thornton': 8}

car_owner_series = pd.Series(car_dict) # create a series from a dict
car_owner_series

Why use Pandas series?

Combines:
- Dictionary style fast lookup.
- Numpy style vectorized operations on the values.


In [None]:
# indexed on sensible keys. 
car_owner_series['Billy Bob Thornton']

In [None]:
# can slice on these keys
car_owner_series["Mohandas Gandhi"
                 :"Kevin Bacon"]

In [None]:
#can do fast computation like a numpy array

# A new set of values. Kevin Bacon bought an extra car and Billy Bob bought two more. 
delta_cars = {'Mohandas Gandhi': 0, 'Billy Bob Thornton': 2, 
              'Saddam Hussein': 0, 'Kevin Bacon': 1, 'Muammar Qaddafi': 0}

delta_cars_series = pd.Series(delta_cars)

In [None]:
print(delta_cars_series)

In [None]:
print(car_owner_series)

Want to update but the two series are not in the same order.

No problem for pandas.

In [None]:
new_car_series = car_owner_series + delta_cars_series
print(new_car_series)

#### Some important Series attributes

- The Series.index attribute: list of indices (keys)

In [None]:
new_car_series.index

- The Series.values attribute: series values returns as numpy array

In [None]:
new_car_series.values

- The Series.name attribute: the name of the series

In [None]:
new_car_series.name = 'Max cars owned'
print(new_car_series)

In [None]:
new_car_series.name

- The Series.dtype: data type for Series values

In [None]:
new_car_series.dtype

Series have some various attached methods.

Examples: sorting by max cars in descending order:

In [None]:
new_car_series.sort_values(ascending = False)

Series have:
- native methods for handling time series data
- whole host of other nice methods.

Will see these later.

#### Pandas DataFrames

We saw these before with the heart disease dataset. Tabular data structure.

- Can create these from a 2D numpy array or a dict of lists: 
    - pd.DataFrame(data, index, columns)
- Very often from csv file: pd.read_csv(...)

Take a new dataset that has data about various breakfast cereals.

In [None]:
# parses header automatically, interprets data with default ',' delimiter
cereal_df = pd.read_csv('Data/cereal.csv', index_col = 'name') #index_col sets name column for Dataframe named index

Often want a quick view of the first few entries in the table data.

The .head() method:

In [None]:
cereal_df.head(2) # default returns first 5 elements

Less common, take a look at the end:

The .tail() method:

In [None]:
cereal_df.tail()

Good common practice: 

Start by looking at some metadata and descriptive statistics on DataFrame.

- .info() method: column data type. Any nulls?
- .describe() method: statistics for each column

In [None]:
cereal_df.info()

In [None]:
cereal_df.describe()

Important basic DataFrame attributes:

- DataFrame.index: list of index names for rows
- DataFrame.columns: list of column names
- DataFrame.shape: returns (number rows, number columns) tuple.


In [None]:
cereal_df.columns

In [None]:
cereal_df.index[0:10]

In [None]:
cereal_df.shape

#### Accessing data in a DataFrame

Accessing data in a Series by named index is easy. Remember:

In [None]:
new_car_series['Billy Bob Thornton']

DataFrames: can access entire **columns** in a similar way. Access the calories column.

In [None]:
cereal_df['calories']

In [None]:
cereal_df.calories # equivalent to cereal_df['calories']

Wait a minute...this is returning a Series with name "calories"! 

Individual columns/rows extracted as pandas Series from the DataFrame architecture.

Can also extract data from a subset of the columns by passing in a list of column names.

DataFrame[list of column names in subset]: returns a DataFrame

In [None]:
col_list = ['calories', 'fat', 'sugars']
cereal_df[['calories', 'fat', 'sugars']]

This is a new dataframe with just the accessed columns in the list. We can access a particular row and column as follows:

DataFrame[column_name][row_name]

In [None]:
cereal_df['sugars']['Fruity Pebbles']

#### The .loc[] accessor (preferred method for row + column selection):

- Access single row by named index
- Complex selections: slicing across both rows and columns, etc
- Really important to use when assigning values in selections.

1. DataFrame.loc[row_accessor]
2. DataFrame.loc[row_accessor, column_accessor]


Accessing a single row with .loc[]

In [None]:
cereal_df.head(8)

In [None]:
cereal_df.loc['All-Bran']

Accessing multiple rows:

In [None]:
cereal_df.head(8)

In [None]:
# select rows by list of index names
row_list = ['All-Bran', 'Almond Delight', 'Apple Jacks']
cereal_df.loc[row_list]

In [None]:
#slice rows by name
cereal_df.loc['All-Bran':'Apple Jacks']

Note: with .loc[],  final entry *is included* in slice.

Accessing multiple columns:

In [None]:
cereal_df.head(8)

In [None]:
# select columns by list
listcol = ["calories", "protein", 
                   "fat","sodium"]
cereal_df.loc["All-Bran", listcol]

In [None]:
# slice on columns by name
cereal_df.loc["All-Bran", 
              "calories":"sodium"]


Putting it altogether (selections on rows and columns):

In [None]:
cereal_df.head(8)

In [None]:
# slicing on rows AND columns
cereal_df.loc["All-Bran":"Almond Delight", 
              "calories":"sodium"]

In [None]:
# accessing all rows and a column subset 
# with .loc accessor 
cereal_df.loc[:, ['protein', 'fat']]

In [None]:
cereal_df[['calories','protein']]

Only difference arises when slicing on columns:
- Really need to use .loc[] accessor for this.

In [None]:
cereal_df.loc[:, 'calories':'sodium']

In [None]:
cereal_df['calories':'sodium']

The .iloc[] accessor:

- Access rows and columns by their integer position instead of named index.
- Everything else pretty much the same as .loc[]

In [None]:
cereal_df.head(5)

In [None]:
cereal_df.iloc[1:4, 2:6]

Note with .iloc slice, last index *NOT included* in slice