<div class="alert alert-block alert-info">

# 05_Lecture - Introduction to Pandas
</div>

In [None]:
import numpy as np
import pandas as pd
# import os
# print(os.getcwd())

<hr style="height:2px; border-width:0; color:pink; background-color:pink">

## Create a DataFrame
<font color=darkblue> A DataFrame is like a structured dataset, similar to a table in SQL or Excel. There are many ways to create a DataFrame. </font>

In [None]:
# Create a DataFrame from a list  
sample_list = ['USA','Canada', 'Brazil', 'Mexico', 'Argentina']  

# Creating a DataFrame
df = pd.DataFrame(sample_list)  
df

In [None]:
# Let's give the column a name
df.columns = ['Country']
df

<li><font color=darkblue> When there is only one column, it is technically a <font color=red>Series,</font> not a DataFrame. </font> </li>
<li><font color=darkblue> Many Series methods return another Series as output, which allows calling further methods in succession, known as <font color=red>Method Chaining. </font> </li>

In [None]:
# Create a DataFrame using a dictionary

df = pd.DataFrame({"Country": ['USA','Canada', 'Brazil', 'Mexico', 'Argentina'],
                   "Capital": ['Washington DC', 'Ottawa', 'Brasilia', 'Mexico City', 'Buenos Aires']})
df

<hr style="height:2px; border-width:0; color:pink; background-color:pink">

## Basic Indexing Operations
<font color=darkblue>  </font>

In [None]:
# You can change the index
df.index = ['GDP_Rank_1', 'GDP_Rank_2', 'GDP_Rank_3', 'GDP_Rank_4', 'GDP_Rank_5']
df

In [None]:
# Retrieve the top & bottom n rows. n defaults to 5
df.head(3)

In [None]:
df.tail(3)

In [None]:
# You can reset the index
df.reset_index() # this will create a new column called 'Index' and place the index values under it
# But it does not alter the dataframe

In [None]:
# To reset index and apply those changes, you need inplace=True
df.reset_index(inplace=True, drop=True) # without drop=True it would have created the 'Index' column
df

In [None]:
# You can add a column
df['GDP_Americas_Rank'] = [each+1 for each in range(0,len(df))]
df
# If you don't fill a column with enough values you will get an error

<li><font color=darkblue> A row is identified by its name or position and in Pandas paralance rows are <font color=red>axis=0</font> </font> </li>
<li><font color=darkblue> A column is identified by its name or position and in Pandas paralance columns are <font color=red>axis=1</font> </font></li>
<li><font color=darkblue> Default when axis is a parameter is <font color=red> axis=0 </font></font></li>
<li><font color=darkblue>I personally use the mnemonic "There's a o in row, therefore a row is axis = 0."</font></li>

In [None]:
# You can delete a column
df.drop('GDP_Americas_Rank',axis=1,inplace=True)
df
# Without passing axis = 1, it would have returned an error

In [None]:
# Let's add index and deleted column to use later
df.index = ['GDP_Rank_1', 'GDP_Rank_2', 'GDP_Rank_3', 'GDP_Rank_4', 'GDP_Rank_5']
df['GDP_Americas_Rank'] = [each+1 for each in range(0,len(df))]
df

<font color=darkblue> Two basic ways of selecting a column: </font>
<li><font color=darkblue> By passing the column name to the indexing operator (my preferred way)</font></li>
<li><font color=darkblue> By using the column name as an attribute</font></li>
<font color=darkblue> <font color=red>NOTE:</font> By selecting a single column, in effect we are creating a Series. </font>

In [None]:
# By passing the column name to the indexing operator
df['Country']

In [None]:
type(df['Country'])

In [None]:
# By using the column name as an attribute
df.Country

In [None]:
type(df['Country'])

<font color=darkblue> Return a DataFrame instead of a Series.</font>
<li><font color=darkblue> Column passed as an <font color = red> item of a list </font> returns a DataFrame.</font></li>
<li><font color=darkblue> Column passed as a string returns a Series.</font></li>

In [None]:
type(df[['Country']])

In [None]:
type(df['Country'])

<hr style="height:2px; border-width:0; color:pink; background-color:pink">

## Advanced Indexing

<font color=darkblue> A more powerful way of indexing is by using the attributes: </font>
<li><font color=darkblue> <font color=red>.loc </font> - extract by row index label <font color=red>label-based indexing</font></font></li>
<li><font color=darkblue> <font color=red>.iloc </font> - extract by row index number <font color=red>position-based indexing</font></font></li>
<font color=darkblue> <font color=red>NOTE:</font> These are called accessor methods, like in, access a slice of data. </font>

### <font color=darkblue>Access rows by index label (<font color=red>.loc</font>) </font>

In [None]:
# Pass the intended index label
df.loc["GDP_Rank_2"]
# Returns a Series when it is only one row

In [None]:
# To pass multiple label indexes use a list or it will result in an error
df.loc[["GDP_Rank_2","GDP_Rank_5"]]

<li><font color=darkblue> Accessing a sequence of index labels is just like in pure Python: <font color=red>df.loc[start:stop:step] </font></li>
<li><font color=red>NOTE: </font>Unlike in pure Python, in <font color=red>.loc</font> the stop index <u>is</u> included.</font></li>

In [None]:
# Fetch all rows between the start & stop indices
df.loc["GDP_Rank_2":"GDP_Rank_5"]

In [None]:
# Fetch alternate rows between the start & stop indices
df.loc["GDP_Rank_2":"GDP_Rank_5":2]

In [None]:
# Fetch everything from a specific index as a starting point
df.loc["GDP_Rank_3":]

In [None]:
# Fetch everything upto a specific index
df.loc[:"GDP_Rank_3"]

### <font color=darkblue>Access rows by index position (<font color=red>.iloc</font>) </font>

In [None]:
# Pass the intended index position
df.iloc[1]
# Returns a Series when it is only one row

In [None]:
# To pass multiple position indexes use a list or it will result in an error
df.iloc[[1,4]]

<li><font color=darkblue>  Accessing a sequence of index labels is just like in pure Python: <font color=red>df.iloc[start:stop:step] </font></li>
<li><font color=red>NOTE:</font> The stop index is <u>not</u> included unlike in .loc. <u>This is similar to pure Python.</u></font></li>

In [None]:
# Fetch all rows between the start & stop indices
df.iloc[1:4]

In [None]:
# Fetch alternate rows between the start & stop indices
df.iloc[0:5:2]

In [None]:
# Fetch everything from a specific index as a starting point
df.iloc[2:]

In [None]:
# Fetch everything upto a specific index
df.iloc[:4]

In [None]:
# Fetch second from last
df.iloc[-2]

In [None]:
# Fetch everything, excluding the last two
df.iloc[:-2]

In [None]:
# Fetch only the last two, excluding everything else
df.iloc[-2:]

### <font color=darkblue>Access rows from specific colums (<font color=red>.loc</font>) </font>
<p></p>
<li> <font color=darkblue>Both <font color=red>.loc</font> & <font color=red>.iloc</font> accept an argument to represent columns</li>
<li> <font color=darkblue>When using <font color=red>.loc</font> we need to provide <font color=red>column name(s)</font></li>    
<li> <font color=darkblue>When using <font color=red>.iloc</font> we need to provide <font color=red>column position(s)</font></li> 

In [None]:
# Select entire DataFrame
df.loc[:,] # To the left of the comma is for rows. : indicates every row, to the right of the comma is for columns
# df.loc[:] # Also returns all columns by default

In [None]:
# Select entire DataFrame
df.iloc[:,] # To the left of the comma is for rows. : indicates every row, to the right of the comma is for columns
# df.iloc[:] # Also returns all columns by default

In [None]:
# Select second and third rows and only the second column
df.loc[['GDP_Rank_2','GDP_Rank_3'],'Capital']
# Row selection is inclusive of the outer bound

In [None]:
# Select second and third rows and only the second column
df.iloc[1:3,1] # Did you spot the difference with .loc when it comes to row selection?
# Unlike in the case of .loc, row selection is exclusive of the outer bound.

In [None]:
# Select alternate rows and second and third columns
df.loc[['GDP_Rank_1','GDP_Rank_3','GDP_Rank_5'],['Capital','GDP_Americas_Rank']]

In [None]:
# Select alternate rows and second and third columns
df.iloc[0:5:2,[1,2]]
# df.iloc[0:5:2,[each for each in range(3)]] # get all columns

### <font color=darkblue>Access a single row x column value using <font color=red>.at</font> and <font color=red>.iat</font> </font>
<p></p>

<li> <font color=darkblue><font color=red>.at</font> takes row and column <u>labels</u></li>    
<li> <font color=darkblue><font color=red>.iat</font> takes row and column <u>indices</u></li>    

In [None]:
df.at["GDP_Rank_3","Capital"]

In [None]:
df.iat[2,1]

<div class="alert alert-block alert-success">

<b>Tip:</b> Take advantage of the ease of creating dataframes to quickly prototype any logic/solution
</div>

In [None]:
# Create a dataframe from random numbers
N = 1000
df = pd.DataFrame(
    {
        "Standard_Normal_Dist": np.random.randn(N),
        "Binomial_Dist": np.random.binomial(n=5, p=0.5, size=N)/10.0,
        "LogNormal_Dist": np.random.lognormal(mean=2, sigma=.75, size=N)/10.0,
    })

df

# np.random.randn(N) # Draw N samples from a standard normal distribution
# np.random.binomial(5, 0.5, size=N) # Result of flipping a coin 5 times, with 0.5 probability of each trial, tested N times
# np.random.lognormal(mean=2, sigma=.75, size=N) # Draw N samples from the lognormal distribution with mean 2 & sd .75

In [None]:
# # Recap
# .DataFrame()
# .columns
# .index
# .head()
# .tail()
# .reset_index(inplace=True, drop=True)
# .drop(axis=1,inplace=True)
# .type()
# .loc[]
# .iloc[]
# .at[]
# .iat[]