# Basic Data Processing with Pandas

Pandas (short for “Python Data Analysis Library ”) is a Python package built on top of Numpy.  It is useful for special array handling, data manipulation, plotting, and web scraping.  

Pandas is particularly strong in the area of handling spreadsheet structures, dealing with missing data, and processing time series data.



These are the new data types introduced by pandas (most used ones are in bold):

- **Series**: 1D labeled homogeneously-typed array.
- **DataFrame**: General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns.
- *Time Series*: Series with index containing datetimes.
- *Panel*: General 3D labeled, also size-mutable array.

Import the package, as follows:

In [None]:
import numpy as np
import pandas as pd

## 1. Series

- A series is a one-dimensional array-like object.   
- Each element has an associated data label, called its *index*. By default, the index consists of ordinary array indices, i.e. consecutive integers starting from zero.
### 1.1 Series Operations

In [None]:
series1 = pd.Series(['Alice', 'Bob', 'Charlie', 'Dahlia'])
series1

In [None]:
#add a name
series1 = pd.Series(['Alice', 'Bob', 'Charlie', 'Dahlia'], name ='name')
series1

In [None]:
series1.index  #this is the default index

- An entry can be retrieved using the index, as follows:

In [None]:
series1[0]

- Often it will be more desirable to create a series with a custom index. 
- Here the index is manually set the index from 101 to 104

In [None]:
series2 = pd.Series(['Alice', 'Bob', 'Charlie', 'Dahlia'], index=[101, 102, 103, 104])
series2

- Calling that entry gives both values.  In this way a series is different from a dictionary.

In [None]:
series2.index #custom index

In [None]:
series2[101]

- The attribute `values` returns all the values.

In [None]:
series2.values

In [None]:
series2.values[1]   # obj.values is simply an array 

### 1.2 Series and the Dictionary object
#### Dictionaries
Dictionaries are also what they sound like - a list of definitions that correspond to unique terms.

*Keys* and *values* are to a dictionary what words and their definitions are to an English dictionary. 

Each entry in a dictionary is called a key-value pair, and they are bound together by a colon `:` . 

To create a dictionary, use curly braces and declare each key-value pair separated by commas

In [None]:
student_id = {
    101: 'Alice', # a key-value pair
    102: 'Bob',
    103: 'Charlie',
    104: 'Dahlia',
}
student_id

Similarly to a list, you can use brackets to access the value, but this time with the key (instead of the index)

In [None]:
student_id[101]

The Series object is similar to a dictionary, `Series.index` is like `dictionary.keys`, and `Series.values` is like `dictionary.values`. Directly convert a dictionary to a Series, as follows:

In [None]:
series3 = pd.Series(student_id)
series3 

Convert a Series back to a dictionary.

In [None]:
series3.to_dict()

## 2. DataFrame

- A data frame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns.
- Each column can be a different type (integers, strings, floating point numbers, Python objects, etc.).  

### 2.1 Creating DataFrames
A DataFrame may be created using a dict 

In [None]:
#declare dict object
data = {'commodity': ['Gold', 'Gold', 'Silver', 'Silver'],
        'year': [2016, 2017, 2018, 2019],
        'production_moz': [107.6, 109.7, 868.3, 886.7]} #world wide in million oz

# convert to DataFrame
df = pd.DataFrame(data)
df

Heres how it is done using nested lists

In [None]:
df_2=pd.DataFrame([['Gold', 2016, 107.6],
                   ['Gold', 2017, 109.7],
                   ['Silver', 2018, 868.3],
                   ['Silver', 2019, 886.7]], 
                    columns=['commodity','year','production_moz'])
df_2

The column names can be accessed by

In [None]:
df.columns 

It also has an index attribute

In [None]:
df.index #standard index

- The index may be set using the method `set_index`, as follows:

In [None]:
df=df.set_index('commodity')
df

- The dataframe can restore the original index using the mathod `reset_index`, as follows:

In [None]:
df = df.reset_index()
df

In [None]:
df.index #custom string index

- The names of columns may be changed as follows:

In [None]:
df.columns=['commodity', 'year','production']
df

### 2.2. Reading a csv to a Dataframe
Lets read a dataframe from a comma separated values (csv) file

In [None]:
df = pd.read_csv("zoo.csv")
df

Some initial methods to check the read file

In [None]:
len(df)   #how many rows?

In [None]:
df.head()      #check first 5 rows

In [None]:
df.tail()  #check last 5 rows

In [None]:
df.info() #shows number of non-empty entries per column

In [None]:
df.shape()  #outputs (number of rows, number of columns)

### 2.3 Slicing a DataFrame

#### 2.3.1. By Column
Columns may be examined one-by-one using brackets. The result may be written as a series or a dataframe

In [None]:
df['animal'] #this yields a pandas series

A hunch may be confirmed on why accessing columns return Series objects -- this is exactly because a dataframe is a combination of Series objects! 
![Series + Series = DataFrame](https://raw.githubusercontent.com/dhrunlauwers/teaching-challenge/master/lesson-material/images/series-and-dataframe.png )

In [None]:
df[['animal']] #this yields a pandas data frame

In [None]:
df.animal       # retrieve by attribute

In [None]:
df[['animal','water_need']]   #get 2 columns

#### 2.3.2 By Row

Rows may be selected using the `.loc` command

In [None]:
df.loc[:2]

This is quite useful when using customized indices

In [None]:
df2 = df.set_index('id')
df2

In [None]:
df2.loc[1011]

In [None]:
df2.loc[[1011,1012,1013]]

For dataframes with custom index, the index locator `.iloc` or the simple bracket indexer `[]` might be more helpful

In [None]:
df2[5:10]

In [None]:
df2.iloc[5:10]

#### 2.3.3 Custom Slicing

In [None]:
df['animal'][2]

In [None]:
df['water_need'][1:3]

In [None]:
df[['animal','water_need']][1:3]

In [None]:
# select as a matrix 
df.iloc[1:5]

In [None]:
# select as a matrix 
df.iloc[1:5,2]

### 2.4 Sorting a DataFrame
- It is possible to order the rows of data frames using `sort_values()`.  This object method takes a column name as an argument.
- It is used on the new_features data frame to order by date of inception, as follows:


In [None]:
df.sort_values('water_need')

By default the sort is done in ascending order.  To apply the sort in decending order, set the `ascending` parameter to `False`.

In [None]:
df.sort_values('water_need',ascending=False)

### 2.5 Filtering a DataFrame
First, lets see what happens to a Series object under a conditional 

In [None]:
df['water_need']>=200

A DataFrame may be filtered by passing this Series of booleans into the bracket operator

In [None]:
df[df['water_need']>200]

You may also combine the conditionals

In [None]:
df[(df['water_need']>200)&(df['animal']=='tiger')]

`.between` is a convenient way to filter within a range

In [None]:
df[(df['water_need'].between(300,500))] #equivalent to df[(df['water_need']>300)&(df['water_need']<500)]

`isin` is a convenient way of filtering given a lookup list

In [None]:
df[(df['water_need'].between(300,500))&(df['animal'].isin(['tiger','lion']))] #equivalent to df[(df['water_need']>300)&(df['water_need']<500)]

### 2.6 Removing and Adding entries in a Dataframe

#### 2.6.1 Removing methods

In [None]:
#this removes row 'two'
df.loc[df.index != 2]

In [None]:
#this removes column 'a'
df.drop('animal', axis = 1)

In [None]:
#remove all rows with tiger
df.loc[df['animal'] != 'tiger']

- Rows and columns may also be removed using fancy indexing or `drop()`.

#### 2.6.2 Adding methods

In [None]:
#this adds a row at the end of the dataset
df.loc[len(df)] = ['zookeper',0000,50,3]
df

In [None]:
#add a column with constant value
df['is_alive'] = True
df

### 2.7 DataFrame Aggregations
We can take some basic stats on the dataframe

In [None]:
df['water_need'].mean()

In [None]:
df['water_need'].median()

In [None]:
df[df['animal']=='elephant']['water_need'].mean()

In [None]:
#reset
df = pd.read_csv("zoo.csv")

### 2.8 Handling Missing Data

- Missing or, equivalently, corrupt data is an unavoidable reality in processing large data sets.  There are various ways of dealing with it, depending upon the circumstances:
 - Discard it, and all related data.
 - Interpolate values from surrounding data
 - Isolate it and analyze it separately

- Which approach to use is a scientific question.  Whatever approach is chosen, pandas has computational methods to carry it out.

- To figure out where the missing data is, use the `isnull()` method.

In [None]:
df.isnull()

- Summing up the boolean array reports how many missing values are in each column.

In [None]:
np.sum(df.isnull())

- To isolate the rows in which there are null values, aggregate the `df.isnull()` boolean data frame along rows, using `any` with `axis=1`.

In [None]:
df.loc[df.isnull().any(axis=1)]

- If you want to isolate only those `_need` rows with all null columns, use `all` instead, with `axis=1`.

In [None]:
df.loc[df[['water_need','meat_need']].isnull().all(axis=1)]

We can drop the rows with nulls

In [None]:
df.loc[~df.isnull().any(axis=1)]


An alternative to discarding information is to **impute** the data. 
This can be done with the `fillna()` function with the value to be imputed as the argument.

In [None]:
df['meat_need'].fillna(0)

Another common way to impute is by the mean of the column.

In [None]:
df['water_need'].fillna(df['water_need'].mean())

### Exercise

Read the csv 'employees.csv'

Print the data frame, looking for missing values by inspection.

In [1]:
### Your code here
import pandas as pd
df = pd.read_csv('employees.csv')
df

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
0,IT,Bachelor,M,analyst,1.0,Bob,90.0
1,IT,Master,M,analyst,2.0,Jake,90.0
2,HR,Master,M,analyst,2.0,John,90.0
3,HR,Bachelor,F,analyst,2.0,Judy,90.0
4,Trade,PHD,M,associate,3.0,Sam,120.0
5,?,PHD,F,associate,5.0,Amy,120.0
6,Trade,Master,F,associate,,Jennifer,120.0
7,HR,Master,M,VP,8.0,Peter,262.5
8,IT,?,F,VP,9.0,Mary,262.5


See '?' in the data frame. For a small data frame like this the '?' may be replaced by `np.nan` manually. In dealing with a large data frame, it is more efficient to use the function `replace`. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html

Use `replace` to swap '?' with `np.nan`. 

In [11]:
#### Your code here
import numpy as np
df.replace('?',np.nan,inplace = True)
df.head()

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
0,IT,Bachelor,M,analyst,1.0,Bob,90.0
1,IT,Master,M,analyst,2.0,Jake,90.0
2,HR,Master,M,analyst,2.0,John,90.0
3,HR,Bachelor,F,analyst,2.0,Judy,90.0
4,Trade,PHD,M,associate,3.0,Sam,120.0


Print the rows with missing values.

In [13]:
#### Your code here
df.loc[df.isnull().any(axis=1)]

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
5,,PHD,F,associate,5.0,Amy,120.0
6,Trade,Master,F,associate,,Jennifer,120.0
8,IT,,F,VP,9.0,Mary,262.5


Print the columns with missing values.

In [15]:
#### Your code here
np.sum(df.isnull())

Department    1
Education     1
Sex           0
Title         0
Year          1
Name          0
Salary        0
dtype: int64

Filter all employees listed with Masters degree

In [18]:
#### Your code here
df_master = df[df['Education'] == "Master"]
df_master.head()

Unnamed: 0,Department,Education,Sex,Title,Year,Name,Salary
1,IT,Master,M,analyst,2.0,Jake,90.0
2,HR,Master,M,analyst,2.0,John,90.0
6,Trade,Master,F,associate,,Jennifer,120.0
7,HR,Master,M,VP,8.0,Peter,262.5


Get the average salary of the IT Department

In [19]:
#### Your code here
df[df['Department']=="IT"]['Salary'].mean()

147.5

Get the average years of the HR Department

In [None]:
#### Your code here

In [20]:
df[df.Department == "HR"]['Year'].mean()

4.0

In [21]:
df[df.Department == "HR"].Year.mean()

4.0