# Pandas

**Pandas** is Python’s most widely used library for data analysis, and contains functions for accessing, aggregating, joining, and analyzing data. 
Its data structure, the DataFrame, is analogous to SQL tables or Excel worksheets


## Pandas Series
**Pandas Series is the equivalent of a column of data.** In this section we will cover their basic properties, creation, manipulation and useful functions.

Pandas series are really just numpy arrays with additional features layered into them to make them easier to work with.

Topics covered:

* Pandas Series Basics
* Series Indexing
* Sorting and Filtering
* Operations and Aggregations
* Handling Missing Data
* Applying Custom Functions

### Pandas Series Basics

**Series** are Pandas data structures built on top of NumPy arrays

* Series also contain an index and an optional name, in addition to the array of data
* They can be created from other data types but are usually imported from external sources
* Two or more Series grouped together form a Pandas DataFrame


Pandas Series have the following key properties:
* **values** - the data array in the Series
* **index** - the index array in the Series
* **name** - the optional name for the Series (usefull for accessing columns in a DataFrame)
* **dtype** - the data type of the elements in the values array

In [1]:
import numpy as np
import pandas as pd

In [2]:
sales = [0,5,155,0,518,0,1827,616,317,325]

sales_series = pd.Series(sales,name='sales')

sales_series

0       0
1       5
2     155
3       0
4     518
5       0
6    1827
7     616
8     317
9     325
Name: sales, dtype: int64

In [3]:
sales_series.values

array([   0,    5,  155,    0,  518,    0, 1827,  616,  317,  325],
      dtype=int64)

### Pandas Data Types

**Numeric**:
* boolean - Nullable Boolean True/False
* Int64 - Nullable whole numbers
* Float64 - Nullable decimal numbers

**Text**:
* String - Only contains strings or text
* Category - Maps categorical data to a numeric array for efficiency

**Time Series**:
* datetime64 - A single moment in time (January 4,2015,2:00:00 PM)
* timedelta -  The duration between two dates or times
* period - A span of time (a day, a week, etc)

It's possible to convert the data type in a Pandas Series by using the .astype() method and specifying the desired data type if compatible.

In [4]:
sales_series

0       0
1       5
2     155
3       0
4     518
5       0
6    1827
7     616
8     317
9     325
Name: sales, dtype: int64

In [5]:
sales_series.astype('float')

0       0.0
1       5.0
2     155.0
3       0.0
4     518.0
5       0.0
6    1827.0
7     616.0
8     317.0
9     325.0
Name: sales, dtype: float64

### Series Indexing

The index lets us easily access 'rows' in a Panda Series or DataFrame and is one of the key distinguishing factores between an array and a Pandas Series.

Genereally we should not assign categorical values to the index, but there are cases where it's applicable to use a custom index for accesing rows. This will become more relevant when working with datetimes.

In [6]:
sales2 = sales_series[:5]
sales2.index = ['Day 1','Day 2','Day 3','Day 4','Day 5']
sales2

Day 1      0
Day 2      5
Day 3    155
Day 4      0
Day 5    518
Name: sales, dtype: int64

In [7]:
sales2['Day 3']

155

In [8]:
sales2['Day 1':'Day 3']

Day 1      0
Day 2      5
Day 3    155
Name: sales, dtype: int64

In [9]:
sales2[::2]

Day 1      0
Day 3    155
Day 5    518
Name: sales, dtype: int64

### The ILOC

The **.iloc[]** method is the preferred way to access values by their positional index.
* It works even when Series have a custom, non-integer index
* It is more efficient than slicing and is actually recommended by Pandas' creators

`df.iloc[row position,column position]`

In [10]:
sales_series.iloc[2]

155

In [11]:
sales_series.iloc[2:4]

2    155
3      0
Name: sales, dtype: int64

In [12]:
sales_series.iloc[-1]

325

In [13]:
sales_series.iloc[[0,1,2]]

0      0
1      5
2    155
Name: sales, dtype: int64

### The LOC Method

The **.loc[]** method is more used and is the preferred way to access values by their custom labels. 

Normally its more used because most of the time we're trying to access both rows as well as a subset of columns and referencing column names by their labels. **Opposed to iloc referencing column index 0,1 or 2 with loc we will be referencing column 'price', 'customer', 'sales amount'.**

`df.loc[row label,column label]`

In [14]:
sales2

Day 1      0
Day 2      5
Day 3    155
Day 4      0
Day 5    518
Name: sales, dtype: int64

In [15]:
#LOC method access values by their custom labels
sales2.loc['Day 5']

518

In [16]:
#ILOC method access values by their index position
sales2.iloc[-1]

518

### Duplicate Index Values & Resetting The Index

**It's possible to have duplicated index values** in a Pandas Series or DataFrame. However if we don't touch our index, we won't have to worry about that. But if we create an index that has duplicated values we will have different rows with the same index which is not the ideal scenario. The best practice is to have at least one unique row identifier that allows us to return one row for one index or one lookup value.

* Accessing these indices by their label using `.loc[]` returns all corresponding rows

Nevertheless, we can always **reset the index** back to the default range of integers by using the `.reset_index()` method. By default the index will be reseted and the previous index will become a new column

In [17]:
sales3 = sales_series[:5]
sales3.index = ['Day 1','Day 1','Day 3','Day 4','Day 5']
sales3

Day 1      0
Day 1      5
Day 3    155
Day 4      0
Day 5    518
Name: sales, dtype: int64

In [18]:
sales3.loc['Day 1']

Day 1    0
Day 1    5
Name: sales, dtype: int64

In [19]:
sales3.reset_index()
#If we don't want to save the previous index as a new column, we can drop it: sales3.reset_index(drop=True)

Unnamed: 0,index,sales
0,Day 1,0
1,Day 1,5
2,Day 3,155
3,Day 4,0
4,Day 5,518


### Filtering Series

We can filter a Series by passing a logic test into the `.loc[]` method (like arrays).

In [25]:
sales3

Day 1      0
Day 1      5
Day 3    155
Day 4      0
Day 5    518
Name: sales, dtype: int64

In [28]:
sales3.loc[sales3 > 0]

Day 1      5
Day 3    155
Day 5    518
Name: sales, dtype: int64

In [32]:
sales3.loc[(sales3.index == 'Day 1') & (sales3 > 4)]

Day 1    5
Name: sales, dtype: int64

### Logical Operators & Methods

We can use the following operators & methods to create Boolean filters for logical tests. Normally Python operators are more used:
![Capture.PNG](https://i.ibb.co/1q3djB4/Capture.png)

In [33]:
sales3 == 5

Day 1    False
Day 1     True
Day 3    False
Day 4    False
Day 5    False
Name: sales, dtype: bool

In [34]:
sales3.eq(5)

Day 1    False
Day 1     True
Day 3    False
Day 4    False
Day 5    False
Name: sales, dtype: bool

In [54]:
(sales3.index.isin(['Day 1'])) & (sales3 == 5)

Day 1    False
Day 1     True
Day 3    False
Day 4    False
Day 5    False
Name: sales, dtype: bool

In [41]:
~sales3.index.isin(['Day 1'])

array([False, False,  True,  True,  True])

### Sorting Series

We can sort Series by their values or their index. By default the `sort()` sorts a series by its values in ascending order.

In [47]:
sales_series

0       0
1       5
2     155
3       0
4     518
5       0
6    1827
7     616
8     317
9     325
Name: sales, dtype: int64

In [48]:
sales_series.sort_values()

0       0
3       0
5       0
1       5
2     155
8     317
9     325
4     518
7     616
6    1827
Name: sales, dtype: int64

In [50]:
sales_series.sort_index(ascending=False)

9     325
8     317
7     616
6    1827
5       0
4     518
3       0
2     155
1       5
0       0
Name: sales, dtype: int64

### Arithmetic Operators and Methods

There are several operators and methods that we can use to perform numeric operations in series. In most cases its more frequent to stick with the python operators but there's a couple of use cases where those pandas methods can be very handy:
![Capture.PNG](https://i.ibb.co/pQwJyZ5/Capture.png)

In [61]:
sales_series = sales_series.loc[:4]
sales_series

0      0
1      5
2    155
3      0
4    518
Name: sales, dtype: int64

In [62]:
sales_series + 2

0      2
1      7
2    157
3      2
4    520
Name: sales, dtype: int64

In [68]:
sales_series.astype('float').astype('string') + ' €'

0      0.0 €
1      5.0 €
2    155.0 €
3      0.0 €
4    518.0 €
Name: sales, dtype: string

In [69]:
#Or we can use the pandas addition method
sales_series.add(2)

0      2
1      7
2    157
3      2
4    520
Name: sales, dtype: int64

In [73]:
#The pandas addition method is very handy when we have missing data
sales_series[1] = np.nan
sales_series

0      0.0
1      NaN
2    155.0
3      0.0
4    518.0
Name: sales, dtype: float64

In [74]:
sales_series.add(2)

0      2.0
1      NaN
2    157.0
3      2.0
4    520.0
Name: sales, dtype: float64

In [81]:
#We can use the fill_value to impute zero before performing the addition)
sales_series.add(2,fill_value=0)

0      2.0
1      2.0
2    157.0
3      2.0
4    520.0
Name: sales, dtype: float64

### String Methods

The Pandas `.str` accessor lets us access many string methods such as:
![Capture.PNG](https://i.ibb.co/b7VzY4G/Capture.png)

In [86]:
tech_stuff = pd.Series(['TV','Phone','TV','Computer','Watch'])
tech_stuff

0          TV
1       Phone
2          TV
3    Computer
4       Watch
dtype: object

In [87]:
tech_stuff.str.upper()

0          TV
1       PHONE
2          TV
3    COMPUTER
4       WATCH
dtype: object

In [94]:
tech_stuff.loc[tech_stuff.str.contains('TV')]

0    TV
2    TV
dtype: object

In [95]:
tech_prices = pd.Series(['TV 999','Phone 350','TV 876','Computer 779','Watch 159'])
tech_prices

0          TV 999
1       Phone 350
2          TV 876
3    Computer 779
4       Watch 159
dtype: object

In [98]:
#If we have useful data combined into a single string that we want to break up into columns the split method will achieve that
tech_prices.str.split(' ',expand=True)

Unnamed: 0,0,1
0,TV,999
1,Phone,350
2,TV,876
3,Computer,779
4,Watch,159


### Numeric Series Aggregation

We can perform a lot of different types if aggregations on our numeric series.
![Capture.PNG](https://i.ibb.co/FYvDs63/Capture.png)

In [103]:
sales_series

0      0.0
1      2.0
2    155.0
3      0.0
4    518.0
Name: sales, dtype: float64

In [104]:
sales_series.count()

5

In [105]:
sales_series.sum()

675.0

In [106]:
sales3

Day 1      0
Day 1      5
Day 3    155
Day 4      0
Day 5    518
Name: sales, dtype: int64

In [108]:
sales3.loc['Day 1'].sum()

5

### Categorical Series Aggregation

The following methods here tend to work best on text fields of categorical fields that have values repeated throughout a series of data but we can call them on numeric series as well.
![Capture.PNG](https://i.ibb.co/TBHpW68/Capture.png)

In [109]:
tech_stuff

0          TV
1       Phone
2          TV
3    Computer
4       Watch
dtype: object

In [111]:
tech_stuff.value_counts()

TV          2
Phone       1
Computer    1
Watch       1
dtype: int64

In [112]:
#% value count
tech_stuff.value_counts(normalize=True)

TV          0.4
Phone       0.2
Computer    0.2
Watch       0.2
dtype: float64

In [113]:
tech_stuff.unique()

array(['TV', 'Phone', 'Computer', 'Watch'], dtype=object)

In [116]:
print('Unique items in the series:',tech_stuff.nunique())

Unique items in the series: 4


### Missing Data

Missing data in Pandas is often represented by NumPy 'NaN' values
* This is more efficient than Python's 'None' data type
* Pandas treats NaN values as a float, which allows them to be used in vectorized operations

`np.nan` creates a NaN value however these are rarely created by hand and typically appear when reading in data from external sources

In [181]:
sales = [5,10,np.nan,25,30,np.nan,15]
sales_series = pd.Series(sales,name='nan_sales')
sales_series

0     5.0
1    10.0
2     NaN
3    25.0
4    30.0
5     NaN
6    15.0
Name: nan_sales, dtype: float64

### Identifying Missing Data

The `.isna()` and `.value_counts()` methods will allow us to identify missing data in a Series

* The `.isna()` returns True if a value is missing and False otherwise
* The `.value_counts()` method returns unique values and their frequency but it supresses null values by default, so we need to specify `dropna=False` to get a count of missing values

In [131]:
sales_series.isna()

0    False
1    False
2     True
3    False
4    False
5     True
6    False
Name: nan_sales, dtype: bool

In [132]:
#To count the null values we have to use the sum() because isna() returns True or False that works as 1 or 0
sales_series.isna().sum()

2

In [133]:
sales_series.value_counts(dropna=False)

NaN     2
15.0    1
30.0    1
25.0    1
10.0    1
5.0     1
Name: nan_sales, dtype: int64

### Fixing Missing Data

We can either decide to drop missing values or fill those missing values. `.dropna()` and `.fillna()` methods allow us to handle missing data in a Series

* The `.dropna()` method removes NaN values from the Series or DataFrame
* The `.fillna(value)` method replaces NaN values with a specified value


In [177]:
sales_series

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

In [182]:
sales_series.dropna().reset_index(drop=True)

0     5.0
1    10.0
2    25.0
3    30.0
4    15.0
Name: nan_sales, dtype: float64

After using the `.dropna()` method we can notice that our index will have gaps but we can always use the reset index method to get back to the continuous set of integers

In [184]:
sales_series.fillna(0)

0     5.0
1    10.0
2     0.0
3    25.0
4    30.0
5     0.0
6    15.0
Name: nan_sales, dtype: float64

It's important to **be thoughtful and deliberate** in how we handle the missing data. We can remove the missing data, we can replace it with zero but we can also impute the missing data with something that makes sense.

For example, impute the missing values with the mean. It will keep our summary statistics the same while giving us values to work with.

However these operations can dramatically impact the results of an analysis, so we should understand the impacts before acting upon missing date and talk to a data SME (Subject Matter Expert) to understand why data is missing.

In [143]:
sales_series.fillna(sales_series.mean())

0     5.0
1    10.0
2    17.0
3    25.0
4    30.0
5    17.0
6    15.0
Name: nan_sales, dtype: float64

### Applying Custom Functions to Series


The `.apply()` method lets us apply custom functions to Pandas Series
* This function will not be vectorized so it wont be as efficient as native functions that already exist in NumPy and Pandas

In [155]:
# This function applies a 10% discount to prices over 20
def discount(price):
    
    if price > 20:
        return round((price * 0.9),2)
    
    return price

In [156]:
sales_series.fillna(8,inplace=True)
sales_series

0     5.0
1    10.0
2     8.0
3    25.0
4    30.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

In [157]:
sales_series.apply(discount)

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

In [172]:
#We can also use lambda functions for one-off tasks
sales_series.apply(lambda x: round(x*0.9,2) if x > 20 else x)

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

### The Where Method

Pandas `.where()` method returns series values based on a logical condition. It's very similar to the numpy where function but it's going to be a little bit less complete.

`df.where(logical test
          ,value if False
          ,inplace = False)`
        
We only have the option to specify values if the value is false. We also have the option to do this inplace or not inplace, by default is not inplace.

It's also possible to chain `.where()` to combine logical expressions.

In [173]:
sales_series

0     5.0
1    10.0
2     8.0
3    25.0
4    30.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

In [175]:
sales_series.where(sales_series <= 20
                  ,round(sales_series * 0.9,2)
                   ,inplace=False
                  )
sales_series

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

Since the `.where()` method for pandas only allows to set an option if the value is False, we have to think about the false cases for our test, so it can be a bit tricky to think about the false case first. However we can use the `~` trick so we can invert the truth values of a test to get things into a format that can be a little bit more intuitive.

In [189]:
sales_series.where(~(sales_series > 20)
                  ,round(sales_series * 0.9,2)
                   ,inplace=False
                  )

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
Name: nan_sales, dtype: float64

Still, since the NumPy `.where()` function is more complete, we can always use it, having in mind that will return a NumPy array that we will need to convert into a Pandas Series

In [192]:
pd.Series(np.where(sales_series > 20
                   ,round(sales_series * 0.9,2)
                   ,sales_series)
         )

0     5.0
1    10.0
2     8.0
3    22.5
4    27.0
5     8.0
6    15.0
dtype: float64

### Key Takeawys

**1 - Pandas Series add an index and title to NumPy arrays**
* Pandas Series form the columns for DataFrames

**2 - The `.loc()` and `.iloc()` methods are key in working with Pandas data structures**
* These methods allow us to access rows in series, either by their positional index or by their labels

**3 - Pandas and NumPy have similar operations for filtering, sorting and aggregating**
* We should use built-in Pandas and NumPy functions to take advantage of vectorization, which is much more efficient than writing for loops in base Python

**4 - Pandas lets us easily handle missing data**
* It's important to understand the impact dropping or imputing might have on some analysis, so we should always make sure to consult an expert about the root cause of missing data