# The Fundamentals of Pandas
Pandas is a data analysis library for python that enables powerful and easy ingress, manipulation, and storage of data. This notebook will cover some of the basics of using the Pandas library, for more extensive information, please visit the offical documentation [here](https://pandas.pydata.org/).\
\
The notebook is broken into 5 sections for mastering the basics of Pandas:
### Table of Contents
0. [Pandas Setup](#0)
1. [Pandas Data Structures](#1)
    * [Series](#1.1)
    * [DataFrames](#1.2)
2. [Creating DataFrames](#2)
    * [A list of dictionaries](#2.1)
    * [A dictionary of lists](#2.2)
    * [Reading from a SQL database](#2.3)
    * [Web Scraping a table](#2.4)
    * [Reading from a CSV](#2.5)
3. [Reading data from a DataFrame](#3)
    * [Keys/Indexing](#3.1)
    * [Using iloc[ ]](#3.2)
    * [Using loc[ ]](#3.3)
    * [Conditional views](#3.4)
4. [Data Manipulation in Pandas](#4)
    * [Drop](#4.1)
    * [DropNA](#4.2)
    * [Duplicated](#4.2.5)
    * [Drop Duplicates](#4.3)
    * [At and Iat](#4.4)
    * [Append](#4.5)
    * [Join](#4.6)
    * [GroupBy](#4.7)
    * [Binning Data](#4.8)
5. [Views vs. Copies](#5)

<a id="0"></a>
## 0) Pandas Setup
To install Pandas, simply open a terminal and run ```pip install pandas```, or run the following cell to accomplish the same:

In [1]:
!pip install pandas



You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.


Next, we can import pandas for use in our script or notebook.

In [2]:
# Importing Pandas as a dependency. We alias the library to "pd" using the "as" operator to make it shorter to write in our code.
import pandas as pd

<a id="1"></a>

## 1) Pandas Data Structures
Pandas features two major data structures: Series and DataFrames.

<a id="1.1"></a>
### Series
Series objects are indexed, one-dimensional arrays that behave similar to native python lists or dictionaries, as well as featuring several methods native to pandas.

In [3]:
# A native python list
my_data = ["a","b","c","d","e","f","h","i","j"]

# Conversion to Pandas Series object
my_series = pd.Series(my_data)

# OPTIONAL: Naming our series
my_series.name = "My Letters"

# Printing out our Series
my_series

0    a
1    b
2    c
3    d
4    e
5    f
6    h
7    i
8    j
Name: My Letters, dtype: object

On the Left, the series index is visible, starting from 0. To the right is the data we created in our list. At the bottom, we can see the optional name we added to the series, as well as the datatype of the data within the series.

While this may not seem terribly impressive compared to a normal python list, this allows us to then use special pandas methods with our data. Below, the ```decribe()``` method is used to easily return statistical analysis on a set of data.

In [4]:
my_data_2 = [23,52,62,25,24,22,21,28,32]
my_series_2 = pd.Series(my_data_2)
my_series_2

0    23
1    52
2    62
3    25
4    24
5    22
6    21
7    28
8    32
dtype: int64

In [5]:
my_series_2.describe()

count     9.000000
mean     32.111111
std      14.709219
min      21.000000
25%      23.000000
50%      25.000000
75%      32.000000
max      62.000000
dtype: float64

Note that the datatype for the "describe" result is different -that is because this is Series of it's own!

In [6]:
type(my_series_2.describe())

pandas.core.series.Series

You will also notice that the index for a Series does not always have to be numerical like a python list, but can can also be string based like a dictionary. Positions in the Series can thus accessed using the index like a list or dictionary.

In [7]:
print(f"Accessing an item like a list: {my_series_2[0]}.")
print(f"Accessing an item like a dict: {my_series_2.describe()['count']}.")

Accessing an item like a list: 23.
Accessing an item like a dict: 9.0.


<a id="1.2"></a>
### DataFrames
The second major Pandas datatype is the DataFrame. A DataFrame is a tabular (table-like), 2-dimensional(i.e., rows and columns) object that is in many ways the central part of the pandas library. DataFrames are both indexed, like Series, and labeled. The index corresponds to the rows of the DataFrame while the labels correspond to the columns.

In [8]:
# Iniitalizing a DataFrame using the first Series we created.
df = pd.DataFrame(my_series)

# Adding the second Series to the DataFrame
df["My Numbers"] = my_series_2

# Viewing the DataFrame
df

Unnamed: 0,My Letters,My Numbers
0,a,23
1,b,52
2,c,62
3,d,25
4,e,24
5,f,22
6,h,21
7,i,28
8,j,32


The DataFrame features the same index as the Series used to create it's columns. Within the DataFrame, every row and column that makes it up is in fact its own Pandas Series. In other words, a DataFrame really is a matrix of intersecting Series!\
![series_matrix.png](images/series_matrix.png)

<a id="2"></a>

## 2) Creating DataFrames
There are numerous ways to create data frames conveniently built into Pandas depending on the structure of our target data. The following are just a few of the most common:

<a id="2.1"></a>
### A list of dictionaries
This method is ideal for creating dictionaries from data generated within a loop, such as iterating over data from an API.

In [9]:
# Create a series of dictionaries
my_dict_1 = {"Letters": "a", "Num_1": 23, "Num_2": 2}
my_dict_2 = {"Letters": "b", "Num_1": 26, "Num_2": 3}
my_dict_3 = {"Letters": "c", "Num_1": 32, "Num_2": 2}
my_dict_4 = {"Letters": "d", "Num_1": 21, "Num_2": 4}

# Add all of these dictionaries to a list
my_list = [my_dict_1, my_dict_2, my_dict_3, my_dict_4]

# Then convert that into a DataFrame
df = pd.DataFrame(my_list)
df

Unnamed: 0,Letters,Num_1,Num_2
0,a,23,2
1,b,26,3
2,c,32,2
3,d,21,4


<a id="2.2"></a>
### A dictionary of lists
A dictionary of lists is a quick way to hand-write small data into a DataFrame.

In [10]:
# Create a series of lists
column_a = ["a","b","c","d"]
column_b = [23,26,32,21]
column_c = [2,3,2,4]

# Insert them into a dictionary
my_dict = {"Letters": column_a, "Num_1": column_b, "Num_2": column_c}

# Then convert that into a DataFrame
df = pd.DataFrame(my_dict)
df

Unnamed: 0,Letters,Num_1,Num_2
0,a,23,2
1,b,26,3
2,c,32,2
3,d,21,4


<a id="2.3"></a>
### Reading from a SQL database
Data can be read directly from SQL databases using Pandas. For this example, we will use sqlalchemy to quickly build a SQL database from a SQLite file.

In [11]:
# Importing additional dependencies
!pip install --user sqlalchemy
from sqlalchemy import create_engine

# Path to SQLite file
database_path = "data_sources/Census_Data.sqlite"

# Creating the SQL database
engine = create_engine(f"sqlite:///{database_path}")

# Establisting a connection to our database
conn = engine.connect()

# Using pandas to read data out of SQL
census_data = pd.read_sql("SELECT * FROM Census_Data", conn)

# Because this DataFrame is so large, we will use the head() method to print out the top 5 entries.
census_data.head()



You should consider upgrading via the 'c:\python39\python.exe -m pip install --upgrade pip' command.


Unnamed: 0,CityState,city,state,Population,White Population,Black Population,Native American Population,Asian Population,Hispanic Population,Education None,...,Employment Female Computer Engineering,Median Age,Median Male Age,Median Female Age,Household Income,Income Per Capita,Median Gross Rent,Median Home Value,lat,lng
0,"HOUSTON, TX",HOUSTON,TX,3061887,1775897,684416,11586,230549,1368287,54180,...,22637,33.439583,32.55,34.363542,56206.5,32239.52083,956.708333,178233.6842,29.775734,-95.414548
1,"CHICAGO, IL",CHICAGO,IL,2702091,1318869,843633,7554,161478,785374,32800,...,18209,34.526786,33.798214,35.141071,57735.96429,38730.83929,1119.928571,264739.2857,41.867838,-87.67344
2,"BROOKLYN, NY",BROOKLYN,NY,2595259,1126111,870465,8744,297890,509243,48934,...,14845,35.175676,33.367568,36.578378,51469.18919,28309.67568,1261.783784,605743.2432,40.652805,-73.956528
3,"LOS ANGELES, CA",LOS ANGELES,CA,2426413,1068202,324842,15949,273829,1292382,62684,...,12329,35.335484,34.535484,36.06129,47494.58333,30073.19355,1201.766667,557115.0,34.042209,-118.303468
4,"MIAMI, FL",MIAMI,FL,1820704,1361009,363514,2250,33144,1162711,27137,...,6969,38.740741,37.12037,40.262963,51232.90741,25949.35185,1260.833333,243279.6296,25.760268,-80.298511


Do not forget to shutdown the database when we are done with it!

In [12]:
engine.dispose()

<a id="2.4"></a>
### Web Scraping a table
You can scrape table elements directly from HTML using Pandas.

In [13]:
# Defining the URL to scrape from
url = "https://en.wikipedia.org/wiki/List_of_the_highest_major_summits_of_North_America"

# Converting all table elements from the page into DataFrames. This method returns a list of DataFrames from the URL.
mountains_table_list = pd.read_html(url)

# Parsing through the list to find the table we want
mountains_table_list

[     Rank                            Mountain peak                 Region  \
 0       1                Denali[a](Mount McKinley)                 Alaska   
 1       2                           Mount Logan[b]                  Yukon   
 2       3         Pico de Orizaba[c](Citlaltépetl)        Puebla Veracruz   
 3       4                     Mount Saint Elias[d]           Alaska Yukon   
 4       5                       Popocatépetl[e][f]  México Morelos Puebla   
 ..    ...                                      ...                    ...   
 396   396                          Sierra Fría[my]         Aguascalientes   
 397   398                    Hayford Peak[147][mz]                 Nevada   
 398   399  Ulysses Mountain[na][nb](Mount Ulysses)       British Columbia   
 399   400                      Eagle Peak[148][nc]             California   
 400   401                   Sacajawea Peak[nd][ne]                 Oregon   
 
               Mountain range  Elevation Prominence Isolation 

In [14]:
# It looks like the table we want is the second entry (Or now the first, as of the latest update) in the list of tables, so we will save it and print its head.
mountains_df = mountains_table_list[0]
mountains_df.head()

Unnamed: 0,Rank,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation,Location
0,1,Denali[a](Mount McKinley),Alaska,Alaska Range,"20,310 ft","20,146 ft",,".mw-parser-output .geo-default,.mw-parser-outp..."
1,2,Mount Logan[b],Yukon,Saint Elias Mountains,"19,551 ft","17,215 ft",387 mi,60°34′02″N 140°24′20″W﻿ / ﻿60.5671°N 140.4055°W
2,3,Pico de Orizaba[c](Citlaltépetl),Puebla Veracruz,Cordillera Neovolcanica,"18,491 ft","16,148 ft",,19°01′50″N 97°16′11″W﻿ / ﻿19.0305°N 97.2698°W
3,4,Mount Saint Elias[d],Alaska Yukon,Saint Elias Mountains,"18,009 ft","11,250 ft",25.6 mi,60°17′34″N 140°55′51″W﻿ / ﻿60.2927°N 140.9307°W
4,5,Popocatépetl[e][f],México Morelos Puebla,Cordillera Neovolcanica,"17,749 ft","9,974 ft",88.8 mi,19°01′21″N 98°37′40″W﻿ / ﻿19.0225°N 98.6278°W


<a id="2.5"></a>
### Reading from a CSV
One of the most common ways to ingress data using Pandas, the humble CSV.

In [15]:
# Defining the CSV path
path = "data_sources/Census_Data.csv"

# Creating a DataFrame from the CSV
census_data = pd.read_csv(path)
census_data.head()

Unnamed: 0,CityState,city,state,Population,White Population,Black Population,Native American Population,Asian Population,Hispanic Population,Education None,...,Employment Female Computer Engineering,Median Age,Median Male Age,Median Female Age,Household Income,Income Per Capita,Median Gross Rent,Median Home Value,lat,lng
0,"HOUSTON, TX",HOUSTON,TX,3061887,1775897,684416,11586,230549,1368287,54180,...,22637,33.439583,32.55,34.363542,56206.5,32239.52083,956.708333,178233.6842,29.775734,-95.414548
1,"CHICAGO, IL",CHICAGO,IL,2702091,1318869,843633,7554,161478,785374,32800,...,18209,34.526786,33.798214,35.141071,57735.96429,38730.83929,1119.928571,264739.2857,41.867838,-87.67344
2,"BROOKLYN, NY",BROOKLYN,NY,2595259,1126111,870465,8744,297890,509243,48934,...,14845,35.175676,33.367568,36.578378,51469.18919,28309.67568,1261.783784,605743.2432,40.652805,-73.956528
3,"LOS ANGELES, CA",LOS ANGELES,CA,2426413,1068202,324842,15949,273829,1292382,62684,...,12329,35.335484,34.535484,36.06129,47494.58333,30073.19355,1201.766667,557115.0,34.042209,-118.303468
4,"MIAMI, FL",MIAMI,FL,1820704,1361009,363514,2250,33144,1162711,27137,...,6969,38.740741,37.12037,40.262963,51232.90741,25949.35185,1260.833333,243279.6296,25.760268,-80.29851


<a id="3"></a>

## 3) Reading data from a DataFrame
Now that we have our data in a DataFrame format, we need to be able to use it. The first thing we will want to learn to that end is how to read data back out!

<a id="3.1"></a>
### Keys/Indexing
We can parse a DataFrame similar to how we might a dictionary, selecting the column by using its label as a key.

In [16]:
df['Letters']

0    a
1    b
2    c
3    d
Name: Letters, dtype: object

We can further drill down using the index.

In [17]:
df['Letters'][3]

'd'

<a id="3.2"></a>
### Using iloc[ ]
Another option is to navigate the DataFrame entirely by numbers using ```iloc[ ]```. We can retrieve a whole column:

In [18]:
df.iloc[:, 0] # Note the format here, [rows, columns]

0    a
1    b
2    c
3    d
Name: Letters, dtype: object

Or a single cell:

In [19]:
df.iloc[3, 0]

'd'

<a id="3.3"></a>
### Using loc[ ]
The above techniques do not always work well because of the way indexes can be set to non-numerical data in DataFrames and can generally appear cluttered or as a mass of incomprehensible numbers. To get around this issue when we find ourselves in such situations, we can use the ```loc[ ]``` attribute.

In [20]:
df = df.set_index('Letters')
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,2
b,26,3
c,32,2
d,21,4


Because there is no numerical index for us to gauge what item we want with, instead we will use loc

In [21]:
df.loc["d"]

Num_1    21
Num_2     4
Name: d, dtype: int64

In [22]:
df.loc["d","Num_1"]

21

It is a good rule of thumb to keep in mind that ```loc[ ]``` is for selecting data by using words for the rows or columns, while ```iloc[ ]``` is for selecting data using numerical position.

<a id="3.4"></a>
### Conditional views
What if want to view data that only matches certain criteria? For this sort of data parsing, we will want to use a conditional view.

In [23]:
# To start, let's clean up the mountains_df we created earlier.
# You can ignore this cell unless you just want to see an example of lambda functions in action.


# ~~~~~~~~~~~~~~~~ IGNORE THIS CELL ~~~~~~~~~~~~~~~~ #


# Set the index for the mountains DataFrame to the rank column
mountains_df = mountains_df.set_index('Rank')

# Use lambda functions to convert the Prominence, Elevation, and Isolation to numerical datatypes
def convert_ht(x):
    height = x.replace(",","").replace("\xa0ft","")
    return int(height)

def convert_mi(x):
    if isinstance(x, str):
        x = float(x.replace(",","").replace("\xa0mi",""))
    return x

def remove_bad_space(x):
    return x.replace("\xa0"," ")

mountains_df['Region'] = mountains_df['Region'].apply(lambda x:remove_bad_space(x))
mountains_df['Elevation'] = mountains_df['Elevation'].apply(lambda x:convert_ht(x))
mountains_df['Prominence'] = mountains_df['Prominence'].apply(lambda x:convert_ht(x))
mountains_df['Isolation'] = mountains_df['Isolation'].apply(lambda x:convert_mi(x))
mountains_df.head() # Much cleaner looking!

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation,Location
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Denali[a](Mount McKinley),Alaska,Alaska Range,20310,20146,,".mw-parser-output .geo-default,.mw-parser-outp..."
2,Mount Logan[b],Yukon,Saint Elias Mountains,19551,17215,387.0,60°34′02″N 140°24′20″W﻿ / ﻿60.5671°N 140.4055°W
3,Pico de Orizaba[c](Citlaltépetl),Puebla Veracruz,Cordillera Neovolcanica,18491,16148,,19°01′50″N 97°16′11″W﻿ / ﻿19.0305°N 97.2698°W
4,Mount Saint Elias[d],Alaska Yukon,Saint Elias Mountains,18009,11250,25.6,60°17′34″N 140°55′51″W﻿ / ﻿60.2927°N 140.9307°W
5,Popocatépetl[e][f],México Morelos Puebla,Cordillera Neovolcanica,17749,9974,88.8,19°01′21″N 98°37′40″W﻿ / ﻿19.0225°N 98.6278°W


Let's get started with conditional views with an example using only one condition. We will try to view all the mountains where the region is Alaska. The first step will be to create a boolean Series, a Series object that is "True" for each row where the region is Alaska and "False" for anything else.

In [24]:
# Create a Boolean Series
mountains_df['Region'] == "Alaska"

Rank
1       True
2      False
3      False
4      False
5      False
       ...  
396    False
398    False
399    False
400    False
401    False
Name: Region, Length: 401, dtype: bool

Next, we can use this boolean Series as a key. Note that the entire code for the boolean Series is placed between the brackets of ```mountains_df[ ]```.

In [25]:
# Use Boolean Series as a key to output data
mountains_df[mountains_df['Region'] == "Alaska"].head()

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation,Location
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Denali[a](Mount McKinley),Alaska,Alaska Range,20310,20146,,".mw-parser-output .geo-default,.mw-parser-outp..."
6,Mount Foraker[g],Alaska,Alaska Range,17400,7250,14.27,62°57′37″N 151°23′59″W﻿ / ﻿62.9604°N 151.3998°W
10,Mount Bona[k],Alaska,Saint Elias Mountains,16550,6900,49.7,61°23′08″N 141°44′58″W﻿ / ﻿61.3856°N 141.7495°W
12,Mount Blackburn[7][m],Alaska,Wrangell Mountains,16390,11640,60.7,61°43′50″N 143°24′11″W﻿ / ﻿61.7305°N 143.4031°W
13,Mount Sanford,Alaska,Wrangell Mountains,16237,7687,40.3,62°12′48″N 144°07′45″W﻿ / ﻿62.2132°N 144.1292°W


We can also parse a DataFrame using multiple conditions. Each condition must be wrapped in parantheses ```( )```, then separated with the appropriate operator. Unlike regular python, here we use: 

```&``` for ```and```

```|``` for ```or```

In [26]:
mountains_df[(mountains_df['Region'] == "Alaska") & (mountains_df['Elevation'] > 15000)]

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation,Location
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Denali[a](Mount McKinley),Alaska,Alaska Range,20310,20146,,".mw-parser-output .geo-default,.mw-parser-outp..."
6,Mount Foraker[g],Alaska,Alaska Range,17400,7250,14.27,62°57′37″N 151°23′59″W﻿ / ﻿62.9604°N 151.3998°W
10,Mount Bona[k],Alaska,Saint Elias Mountains,16550,6900,49.7,61°23′08″N 141°44′58″W﻿ / ﻿61.3856°N 141.7495°W
12,Mount Blackburn[7][m],Alaska,Wrangell Mountains,16390,11640,60.7,61°43′50″N 143°24′11″W﻿ / ﻿61.7305°N 143.4031°W
13,Mount Sanford,Alaska,Wrangell Mountains,16237,7687,40.3,62°12′48″N 144°07′45″W﻿ / ﻿62.2132°N 144.1292°W


<a id="4"></a>

## 4) Data Manipulation in Pandas
Now that we know how to view our data, we can begin manipulating it.

<a id="4.1"></a>
### Drop
To remove unnecessary data elements, we can use the ```drop()``` method.

In [27]:
# Removing the Location column
mountains_df.drop(columns="Location",inplace=True)
mountains_df.head()

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Denali[a](Mount McKinley),Alaska,Alaska Range,20310,20146,
2,Mount Logan[b],Yukon,Saint Elias Mountains,19551,17215,387.0
3,Pico de Orizaba[c](Citlaltépetl),Puebla Veracruz,Cordillera Neovolcanica,18491,16148,
4,Mount Saint Elias[d],Alaska Yukon,Saint Elias Mountains,18009,11250,25.6
5,Popocatépetl[e][f],México Morelos Puebla,Cordillera Neovolcanica,17749,9974,88.8


<a id="4.2"></a>
### DropNA
We can remove data elements from our DataFrame that contain empty cells using the ```dropna()``` method.

In [28]:
# We can use .info() to see what columns have null values
mountains_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 401 entries, 1 to 401
Data columns (total 6 columns):
Mountain peak     401 non-null object
Region            401 non-null object
Mountain range    401 non-null object
Elevation         401 non-null int64
Prominence        401 non-null int64
Isolation         395 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 21.9+ KB


In [29]:
# Using .drop() to remove rows with empty cells.
mountains_df.dropna(how="any").head()

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Mount Logan[b],Yukon,Saint Elias Mountains,19551,17215,387.0
4,Mount Saint Elias[d],Alaska Yukon,Saint Elias Mountains,18009,11250,25.6
5,Popocatépetl[e][f],México Morelos Puebla,Cordillera Neovolcanica,17749,9974,88.8
6,Mount Foraker[g],Alaska,Alaska Range,17400,7250,14.27
7,Mount Lucania[h][i],Yukon,Saint Elias Mountains,17257,10105,26.7


The ```how="any"``` argument means that we will drop a row if **ANY** of the cells in that row are empty. Alternatively, we can specify ```how="all"``` to only drop rows where **ALL** the cells in that row are empty. This second version can be very useful for cleaning up poorly sourced or formatted data, such as CSVs with unnecessary empty rows.

<a id="4.2.5"></a>
### Duplicated
The ```duplicated( )``` method will display duplicate data from within our DataFrame.

In [30]:
# Creating a DataFrame with duplicate data.
duplicate_df = pd.DataFrame({
    "a": [14,14,23,45,67,32],
    "b": [22,22,23,39,55,22],
    "c": ["w","w","w","x","y","z"]
})
duplicate_df

Unnamed: 0,a,b,c
0,14,22,w
1,14,22,w
2,23,23,w
3,45,39,x
4,67,55,y
5,32,22,z


In [31]:
duplicate_df[duplicate_df.duplicated(keep=False)]

Unnamed: 0,a,b,c
0,14,22,w
1,14,22,w


<a id="4.3"></a>
### Drop Duplicates
The ```drop_duplicates( )``` method is an efficient way to remove an duplicate data from your DataFrame. The "keep" parameter features three possible values: first, last, and False.

In [32]:
# Using the drop_duplicate method. The default value for the "keep" parameter is first.
duplicate_df.drop_duplicates(subset="b")

Unnamed: 0,a,b,c
0,14,22,w
2,23,23,w
3,45,39,x
4,67,55,y


In [33]:
# Specifiying the keep="first" parameter. This keeps the first instance of duplicated data.
duplicate_df.drop_duplicates(keep="first")

Unnamed: 0,a,b,c
0,14,22,w
2,23,23,w
3,45,39,x
4,67,55,y
5,32,22,z


In [34]:
# Specifiying the keep="last" parameter. This keeps the last instance of duplicated data.
duplicate_df.drop_duplicates(keep="last")

Unnamed: 0,a,b,c
1,14,22,w
2,23,23,w
3,45,39,x
4,67,55,y
5,32,22,z


In [35]:
# Specifiying the keep=False parameter. This will drop all duplicated data, including first and last instances.
duplicate_df.drop_duplicates(keep=False)

Unnamed: 0,a,b,c
2,23,23,w
3,45,39,x
4,67,55,y
5,32,22,z


<a id="4.4"></a>
### At and Iat
The ```at[ ]``` and ```iat[ ]``` are similiar to ```loc[ ]``` and ```iloc[ ]```, but instead of viewing the data, they allow us to manipulate, or change, it directly.

In [36]:
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,2
b,26,3
c,32,2
d,21,4


In [37]:
df.at["a","Num_2"] = 5
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,5
b,26,3
c,32,2
d,21,4


In [38]:
df.iat[0,1] = 2
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,2
b,26,3
c,32,2
d,21,4


<a id="4.5"></a>
### Append
Append is a method for combining two DataFrames to create a stack.

In [39]:
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,2
b,26,3
c,32,2
d,21,4


In [40]:
my_dict = {"Letters": ["e","f","g"], "Num_1": [20,23,24], "Num_2": [2,1,3]}
df2 = pd.DataFrame(my_dict).set_index("Letters")
df2

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
e,20,2
f,23,1
g,24,3


In [41]:
df3 = df.append(df2)
df3

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,23,2
b,26,3
c,32,2
d,21,4
e,20,2
f,23,1
g,24,3


<a id="4.6"></a>
### Join
Join also combines DataFrames, but merges them along the lateral dimension.

In [42]:
my_dict = {"Letters": ["a","b","c"], "Num_1": [14,13,14], "Num_2": [7,10,13]}
df = pd.DataFrame(my_dict).set_index("Letters")
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,14,7
b,13,10
c,14,13


In [43]:
my_dict = {"Letters": ["a","b","c"], "Num_3": [20,23,24], "Num_4": [2,1,3]}
df2 = pd.DataFrame(my_dict).set_index("Letters")
df2

Unnamed: 0_level_0,Num_3,Num_4
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,20,2
b,23,1
c,24,3


In [44]:
df3 = df.join(df2)
df3

Unnamed: 0_level_0,Num_1,Num_2,Num_3,Num_4
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,14,7,20,2
b,13,10,23,1
c,14,13,24,3


<a id="4.7"></a>
### GroupBy


In [45]:
mgbdf = mountains_df.groupby("Region").agg(["mean","min","max","std"])
mgbdf.head(10)

Unnamed: 0_level_0,Elevation,Elevation,Elevation,Elevation,Prominence,Prominence,Prominence,Prominence,Isolation,Isolation,Isolation,Isolation
Unnamed: 0_level_1,mean,min,max,std,mean,min,max,std,mean,min,max,std
Region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Aguascalientes,9941.0,9941,9941,,1640.0,1640,1640,,145.6,145.6,145.6,
Alaska,12823.886792,10016,20310,1898.68285,4923.792453,1650,20146,3419.844792,21.714615,2.26,126.3,29.329215
Alaska British Columbia,12742.666667,10016,15325,2657.441313,6837.0,2979,12995,5389.579204,51.953333,5.46,124.4,63.575644
Alaska Yukon,15058.0,13760,18009,1709.690177,6810.6,1950,11250,3489.060877,15.62,2.25,25.6,8.849438
Alberta,11471.555556,10879,12247,475.059236,4864.111111,2438,6670,1423.9996,15.918889,4.26,29.5,9.216612
Alberta British Columbia,11800.333333,11263,12274,508.498115,6516.0,4938,7779,1446.457397,72.3,30.6,98.2,36.46464
Arizona,11590.0,10724,12637,969.258995,5702.333333,4728,6340,857.113956,160.8,82.4,246.0,82.011706
Baja California,10154.0,10154,10154,,6972.0,6972,6972,,208.0,208.0,208.0,
British Columbia,11052.428571,9921,13186,874.006078,6374.821429,1716,10791,2145.744213,60.371786,1.78,349.0,92.648465
California,12596.794118,9895,14505,1417.333749,3806.529412,1676,10080,2422.42327,34.232121,3.09,335.0,63.079248


In [46]:
mgbdf["Elevation"].head()

Unnamed: 0_level_0,mean,min,max,std
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aguascalientes,9941.0,9941,9941,
Alaska,12823.886792,10016,20310,1898.68285
Alaska British Columbia,12742.666667,10016,15325,2657.441313
Alaska Yukon,15058.0,13760,18009,1709.690177
Alberta,11471.555556,10879,12247,475.059236


In [47]:
mgbdf["Elevation"].loc['Alaska British Columbia', "std"]

2657.441313243499

<a id="4.8"></a>
### Binning Data
Binning data is another important technique that is often used in conjuntion with with groupby. Binning involves cutting numerical data into pre-defined ranges for discrete analysis. Other terms for binning are dicretization, bucketing, or quantization. Binning is very useful for processing continuous data, meaning data with numbers that can all unique and do not fit into categories.

In [48]:
# Cutting the DataFrame into bins
bins = [0,14000,16000,18000,20000,22000]
labels = ["0-14000","14000-1600","1600-18000","18000-20000","20000-22000"] # Labels should always be one less than the number of bins

mountains_df['Elevation Range'] = pd.cut(mountains_df['Elevation'], bins, labels=labels)
mountains_df

Unnamed: 0_level_0,Mountain peak,Region,Mountain range,Elevation,Prominence,Isolation,Elevation Range
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Denali[a](Mount McKinley),Alaska,Alaska Range,20310,20146,,20000-22000
2,Mount Logan[b],Yukon,Saint Elias Mountains,19551,17215,387.0,18000-20000
3,Pico de Orizaba[c](Citlaltépetl),Puebla Veracruz,Cordillera Neovolcanica,18491,16148,,18000-20000
4,Mount Saint Elias[d],Alaska Yukon,Saint Elias Mountains,18009,11250,25.6,18000-20000
5,Popocatépetl[e][f],México Morelos Puebla,Cordillera Neovolcanica,17749,9974,88.8,1600-18000
...,...,...,...,...,...,...,...
396,Sierra Fría[my],Aguascalientes,Sierra Madre Occidental,9941,1640,145.6,0-14000
398,Hayford Peak[147][mz],Nevada,Sheep Range,9924,5412,33.8,0-14000
399,Ulysses Mountain[na][nb](Mount Ulysses),British Columbia,Muskwa Ranges,9921,7526,271.0,0-14000
400,Eagle Peak[148][nc],California,Warner Mountains,9895,4362,87.4,0-14000


In [49]:
# Using groupby to get the average statistics in each range
mountains_df.groupby('Elevation Range').mean()

Unnamed: 0_level_0,Elevation,Prominence,Isolation
Elevation Range,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0-14000,11952.266862,4033.539589,44.144265
14000-1600,14446.0,4631.425532,38.856818
1600-18000,16909.333333,7187.444444,33.347778
18000-20000,18683.666667,14871.0,206.3
20000-22000,20310.0,20146.0,


<a id="5"></a>

## 5) Views vs. Copies

In [50]:
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,14,7
b,13,10
c,14,13


In [51]:
view = df['Num_1']
view.iat[0] = 365
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,365,7
b,13,10
c,14,13


In [52]:
view_2 = df['Num_1'].copy()
view_2.iat[0] = 0
df

Unnamed: 0_level_0,Num_1,Num_2
Letters,Unnamed: 1_level_1,Unnamed: 2_level_1
a,365,7
b,13,10
c,14,13


Future inclusions:
* set_index
* reset_index
* add a new column
* rename a column

### Thank you for reading!
-Seth Pruitt