<a href="https://colab.research.google.com/github/adong-hood/cs200/blob/main/ch5_pandas_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Pandas Introduction

This notebook serves as a review of the first four sections of [Chapter 3](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) of the Python Data Science Handbook. Please refer to the book sections for more explainations.

We're going to be using a dataset about movies to try out processing some data with Pandas.

We start with some standard imports:

In [None]:
import pandas as pd


## 1. The Pandas Series Object
<p>A Pandas Series is a one-dimensional array of indexed data. A Pandas Series can have an implicitly defined integer index used to access the values and an explicitly defined index associated with the values.</p>


### 1.1 Creating Series

In [None]:
data_1 = pd.Series([0.25, 0.5, 0.75, 1.0])
data_1

Unnamed: 0,0
0,0.25
1,0.5
2,0.75
3,1.0


In [None]:
type(data_1)

In [None]:
lst_1 = [0.25, 0.5, 0.75, 1.0]
type(lst_1)

In [None]:
data_1.index

In [None]:
data_2 = pd.Series([0.25, 0.5, 0.75, 1.0], name = 'value', index=['a', 'b', 'c', 'd'])
data_2

Unnamed: 0,value
a,0.25
b,0.5
c,0.75
d,1.0


In [None]:
data_2.index

Reindex with new index values.

In [None]:
data_2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
data_2

It can also be created from a dictionary as follows:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict, name = 'population')
population

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


### 1.2 Series indexing and selection

In [None]:
# use this for demonstration
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

Unnamed: 0,0
a,0.25
b,0.5
c,0.75
d,1.0


In [None]:
# Access individual item by implicit index and explicit index
print(data.iloc[1])
print(data['b'])

0.5
0.5


In [None]:
# Slicing by explict index and implicit index
print(data['b':'d'])
print(data.iloc[1:4])

#### Masking concept
"It essentially works with a list of Booleans (True/False), which when applied to the original array returns the elements of interest. Here, True refers to the elements that satisfy the condition (smaller than 4 and larger than 6 in our case), and False refers to the elements that do not satisfy the condition."

<small>https://towardsdatascience.com/the-concept-of-masks-in-python-50fd65e64707</small>

In [None]:
# one mask
(data > 0.3)

Unnamed: 0,0
a,False
b,True
c,True
d,True


In [None]:
# another mask
(data < 0.8)

Unnamed: 0,0
a,True
b,True
c,True
d,False


In [None]:
# combine the masks
(data > 0.3) & (data < 0.8)

Unnamed: 0,0
a,False
b,True
c,True
d,False


In [None]:
# apply boolean mask on original data
data[(data > 0.3) & (data < 0.8)]


Unnamed: 0,0
b,0.5
c,0.75


####Fancy indexing
is conceptually simple: it means passing an array of indices to access multiple, non-contiguous array elements at once.

In [None]:
# fancy indexing
print(data[['a', 'd']])

a    0.25
d    1.00
dtype: float64


## 2. The Pandas DataFrame Object

A DataFrame represents a tabular, spreadsheet-like data structure. The DataFrame has both a row and a column index.  You can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

### 2.1 Creating vai a single Series object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [None]:
states_1 = pd.DataFrame(population, columns=['population'])
states_1

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


### 2.2 Creating via a dictionary of Series objects
A DataFrame can be constructed from a dictionary of Series objects as well:

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

Unnamed: 0,0
California,423967
Texas,695662
New York,141297
Florida,170312
Illinois,149995


In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

Unnamed: 0,0
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [None]:
states_2 = pd.DataFrame({'population': population,
                       'area': area})
states_2

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### 2.2 Dataframe indexing and selection

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [None]:
states_2['area']

Unnamed: 0,area
California,423967
Texas,695662
New York,141297
Florida,170312
Illinois,149995


In [None]:
# adding a new column:
states_2['density'] = states_2['population'] / states_2['area']
states_2

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [None]:
# slicing
states_2['California':'Florida']

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121


In [None]:
# dicing. selecting columns.
states_2[['population', 'density']]

Unnamed: 0,population,density
California,38332521,90.413926
Texas,26448193,38.01874
New York,19651127,139.076746
Florida,19552860,114.806121
Illinois,12882135,85.883763


In [None]:
states_2.index

In [None]:
states_2.columns

## 3.  Indexers: loc, iloc, and ix


<p>The loc attribute allows indexing and slicing that always references the explicit index. The iloc attribute allows indexing and slicing that always references the implicit Python-style index.</p>

In [None]:
states_2

In [None]:
states_2.loc['California':'Florida', ['population', 'area']]

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312


In [None]:
states_2.iloc[:4, :2]

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312


Using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
states_2.loc[:'Florida', :'area']

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312


In [None]:
#loc indexer combines masking and fancy indexing as in the following:

states_2.loc[states_2.density > 100, ['population', 'density']]

Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121
