# Pandas
Pandas is a data analysis library for python that enables powerful and easy ingress, manipulation, and storage of data.

In [1]:
# Importing Pandas as a dependency.
import pandas as pd

## Pandas Data Structures
Pandas features two major data structures: Series and DataFrames.

### Series
Series objects are indexed, one-dimensional arrays that behave similar to native python lists, as well as featuring several methods native to pandas.

In [2]:
# A native python list
my_data = ["a","b","c","d","e","f","h","i","j"]

# Conversion to Pandas Series object
my_series = pd.Series(my_data)

# OPTIONAL: Naming our series
my_series.name = "My Letters"

# Printing out our Series
my_series

0    a
1    b
2    c
3    d
4    e
5    f
6    h
7    i
8    j
Name: My Letters, dtype: object

On the Left, the series index is visible, starting from 0. To the right is the data we created in our list. At the bottom, we can see the optional name we added to the series, as well as the datatype of the data within the series.

While this may not seem terribly impressive compared to a normal python list, this allows us to then use special pandas methods with our data. Below, the ```decribe()``` method is used to easily return statistical analysis on a set of data.

In [4]:
my_data_2 = [23,52,62,25,24,22,21,28,32]
my_series_2 = pd.Series(my_data_2)
# PRINT OUT MY_SERIES-2
my_series_2

0    23
1    52
2    62
3    25
4    24
5    22
6    21
7    28
8    32
dtype: int64

In [5]:
# USE THE DESCRIBE() METHOD WITH MY_SERIES_2
my_series_2.describe()

count     9.000000
mean     32.111111
std      14.709219
min      21.000000
25%      23.000000
50%      25.000000
75%      32.000000
max      62.000000
dtype: float64

Note that the datatype for the "describe" result is different -that is because this is Series of it's own!

### DataFrames
The second major Pandas datatype is the DataFrame. A DataFrame is a tabular (table-like), 2-dimensional(i.e., rows and columns) object that is in many ways the central part of the pandas library. DataFrames are both indexed, like Series, and labeled. The index corresponds to the rows of the DataFrame while the labels correspond to the columns.

In [10]:
# Iniitalizing a DataFrame using the first Series we created.
df = pd.DataFrame(my_series)

# Adding the second Series to the DataFrame
df["My Numbers"] = my_series_2

# Viewing the DataFrame
df

Unnamed: 0,My Letters,My Numbers
0,a,23
1,b,52
2,c,62
3,d,25
4,e,24
5,f,22
6,h,21
7,i,28
8,j,32


The DataFrame features the same index as the series used to create it's columns. Within the DataFrame, every row and column that makes it up is in fact its own Pandas Series.

## Creating DataFrames
There are numerous ways to create data frames conveniently built into Pandas depending on the structure of our target data. The following are just a few of the most common:

### A list of dictionaries
This method is ideal for creating dictionaries from data generated within a loop, such as iterating over data from an API.

In [None]:
# Create a series of dictionaries
my_dict_1 = {"Letters": "a", "Num_1": 23, "Num_2": 2}
my_dict_2 = {"Letters": "b", "Num_1": 26, "Num_2": 3}
my_dict_3 = {"Letters": "c", "Num_1": 32, "Num_2": 2}
my_dict_4 = {"Letters": "d", "Num_1": 21, "Num_2": 4}

# Add all of these dictionaries to a list
my_list = [my_dict_1, my_dict_2, my_dict_3, my_dict_4]

# Then convert that into a DataFrame
# CONVERT THE LIST TO A DATAFRAME HERE
df

### A dictionary of lists
A dictionary of lists is a quick way to hand-write small data into a DataFrame.

In [None]:
# Create a series of lists
column_a = ["a","b","c","d"]
column_b = [23,26,32,21]
column_c = [2,3,2,4]

# Insert them into a dictionary
my_dict = {"Letters": column_a, "Num_1": column_b, "Num_2": column_c}

# Then convert that into a DataFrame
# CONVERT THE DICTIONARY TO A DATAFRAME HERE
df

### Reading from a SQL database
Data can be read directly from SQL databases using Pandas. For this example, we will use sqlalchemy to quickly build a SQL database from a SQLite file.

In [None]:
# Importing additional dependencies
from sqlalchemy import create_engine

# Path to SQLite file
database_path = "data_sources/Census_Data.sqlite"

# Creating the SQL database
engine = create_engine(f"sqlite:///{database_path}")

# Establisting a connection to our database
conn = engine.connect()

# Using pandas to read data out of SQL
census_data = # USE PANDAS TO CONNECT TO THE SQL DATABASE

# Because this DataFrame is so large, we will use the head() method to print out the top 5 entries.
# USE THE HEAD() METHOD TO VIEW DATAFRAME

Don't forget to shutdown the database when we are done with it!

In [None]:
engine.dispose()

### Web Scraping a table
You can scrape table elements directly from HTML using Pandas.

In [None]:
# Defining the URL to scrape from
url = "https://en.wikipedia.org/wiki/List_of_the_highest_major_summits_of_North_America"

# Converting all table elements from the page into DataFrames. This method returns a list of DataFrames from the URL.
mountains_table_list = # USE PANDAS TO SCRAPE THE URL FOR TABLES

# Parsing through the list to find the table we want
mountains_table_list

In [None]:
# It looks like the table we want is the second entry in the list of tables, so we will save it and print its head.
mountains_df = # INPUT THE CORRECT TABLE HERE
mountains_df.head()

### Reading from a CSV
One of the most common ways to ingress data using Pandas, the humble CSV.

In [None]:
# Defining the CSV path
path = "data_sources/Census_Data.csv"

# Creating a DataFrame from the CSV
census_data = # USE PANDAS TO READ THE CSV
census_data.head()

## Reading from a DataFrame
Now that we have our data in a DataFrame format, we need to be able to use it. The first thing we will want to learn to that end is how to read data back out!

### Indexing
We can parse a DataFrame similar to how we might a list our dictionary, selecting the column by using its label as a key.

In [None]:
df['Letters']

We can further drill down using the index.

In [None]:
# SELECT A SINGLE CELL USING INDEXING

### Using iloc[ ]
Another option is to navigate the DataFrame entirely by numbers using ```iloc[ ]```. We can retrieve a whole column:

In [None]:
df.iloc[:, 0] # Note the format here, [rows, columns]

Or a single cell:

In [None]:
# SELECT A SINGLE CELL USING ILOC[]

### Using loc[ ]
Thes do not always work well because of the way indexes can be set in DataFrames and generally appear cluttered or as a mass of incomprehensible numbers. To get around this, we can use the ```loc[ ]``` attribute.

In [None]:
df = df.set_index('Letters')
df

Because there is no numerical index for us to gauge what item we want with, instead we will use loc

In [None]:
df.loc["d"]

In [None]:
# SELECT A SINGLE CELL USING LOC[]

### Conditional views

In [None]:
# Set the index for the mountains DataFrame to the rank column
mountains_df = mountains_df.set_index('Rank')

# Use lambda functions to convert the Prominence, Elevation, and Isolation to numerical datatypes
def convert_ht(x):
    height = x.replace(",","").replace("\xa0ft","")
    return int(height)

def convert_mi(x):
    if isinstance(x, str):
        x = float(x.replace(",","").replace("\xa0mi",""))
    return x

mountains_df['Elevation'] = mountains_df['Elevation'].apply(lambda x:convert_ht(x))
mountains_df['Prominence'] = mountains_df['Prominence'].apply(lambda x:convert_ht(x))
mountains_df['Isolation'] = mountains_df['Isolation'].apply(lambda x:convert_mi(x))
mountains_df.head()

Read with one condition

In [None]:
# Create a Boolean Series
mountains_df['Region'] == "Alaska"

In [None]:
# Use Boolean Series as a key to output data
mountains_df[mountains_df['Region'] == "Alaska"].head()

We can also parse a DataFrame using multiple conditions

In [None]:
# PARSE THE DATAFRAME FOR ALASKAN MOUNTAINS TALLER THAN 15,000 FT.


## Data Manipulation in Pandas
Now that we know how to view our data, we can begin manipulating it.

### Drop
To remove unnecessary data elements, we can use the ```drop()``` method.

In [None]:
# Removing the Location column
mountains_df.drop(columns="Location",inplace=True)
mountains_df.head()

### DropNA
We can remove data elements from our DataFrame that contain empty cells using the ```dropna()``` method.

In [None]:
# We can use .info() to see what columns have null values
mountains_df.info()

In [None]:
# Using .drop() to remove rows with empty cells
mountains_df.dropna(how="any").head()

### At and Iat
The ```at[ ]``` and ```iat[ ]``` are similiar to ```loc[ ]``` and ```iloc[ ]```, but instead of viewing the data, they allow us to manipulate it directly.

In [None]:
df

In [None]:
df.at["a","Num_2"] = 5
df

In [None]:
df.iat[0,1] = 2
df

### Append
Append is a method for combining two DataFrames to create a stack

In [None]:
df

In [None]:
my_dict = {"Letters": ["e","f","g"], "Num_1": [20,23,24], "Num_2": [2,1,3]}
df2 = pd.DataFrame(my_dict).set_index("Letters")
df2

In [None]:
df3 = # APPEND DF2 TO DF
df3

### Join
Join also combines DataFrames, but merges them along the lateral dimension.

In [None]:
my_dict = {"Letters": ["a","b","c"], "Num_1": [14,13,14], "Num_2": [7,10,13]}
df = pd.DataFrame(my_dict).set_index("Letters")
df

In [None]:
my_dict = {"Letters": ["a","b","c"], "Num_3": [20,23,24], "Num_4": [2,1,3]}
df2 = pd.DataFrame(my_dict).set_index("Letters")
df2

In [None]:
df3 = # JOIN THE TWO DATA FRAMES TOGETHER
df3

### GroupBy

In [None]:
mountains_df.groupby("Region").mean()