# Python


## Intro to Python

*Computer programming* is the process of giving instructions to a computer to perform an action or set of actions. Computer programming is done using a *programming language*--the words and symbols we use to write instructions for computers to follow.

Data professionals use Python to analyze data in faster, more efficient, and more powerful ways because it optimizes every phase of the data workflow--exploring, cleaning, visualizing data, and creating machine learning models.

Python, R, Java, and C++ are four of the most commonly used programming languages for data analysis. The following chart compares them using five considerations: speed, accessibility, variable, data science focus, and programming paradigm.
- Speed: Compile time, runtime, hardware, installed dependencies, and code efficiency all contribute to the speed of a program's execution.
- Accessibility: Refers to how easy the programming language is to learn and use.
- Variables: The way a program uses variables will have an effect on a systems core operations or kernel speed. Languages that use *static variables* (i.e., strongly-typed) maintain a value throughout the entire run of a program. Languages that use *dynamic variables* (i.e., weakly-typed) allow values to be determined when the program is run.
- Data science focus: Some programming languages have individual characteristics that better serve tasks in data analysis.
- Programming paradigm: *Object-oriented* programming languages are modeled around data objects. *Functional* programming languages are modeled around functions. *Imperative* lanaguages are modeled around code statements that can alter the state of the program itself.


| Features by software | Python | R | Java | C++ |
| --- | --- | --- | --- | --- |
| Speed | Slower | Depends on configuration and add-ons | Faster | Very Fast |
| Accessibility | Easy to learn | Complex | Easy to learn | Complex |
| Variable | Dynamic | Dynamic | Static | Declarative |
| Data science focus | Machine learning and automated analysis | Exploratory data analysis and building extensive statistical libraries | Used across projects with open-source assets | Not as widely used but very powerful implementations | 
| Programming paradigm | Object-oriented | Functional | Object-oriented | Multi-paradigm (imperative & object oriented |

<a name="jupyter-notebooks"></a>
## Jupyter Notebooks

*Jupyter notebooks* are open-source web appplications for creating and sharing documents containing live code, mathematical formuals, visualizations, and text.

Jupyter notebooks are partitioned into *cells*--modular code input or output fields.

Learn more about Jupyter notebooks and the Jupyter project online: [docs.jupyter.org](https://docs.jupyter.org/en/latest/https://docs.jupyter.org/en/latest/).

## Object-Oriented Programming

Object-oriented programming is a programming system that is based around objects, which can contain both data and code that manipulates that data.

An *object* is an instance of a class; a fundamental building block of Python. A *class* is an object's data type that bundles data and functionality together.

As an example, by assigning a value to the *string* class, it enables us to use functionality of a string, including `swapcase`, `replace`, and `split`.

In [1]:
# Assign a string to a variable and check its type
magic = 'HOCUS POCUS'
print(type(magic))

<class 'str'>


In [2]:
# Use swapcase() string method to convert from caps to lowercase
magic = 'HOCUS POCUS'
magic = magic.swapcase()
magic

'hocus pocus'

In [3]:
# Use replace() string method to replace some letters with other letters
magic = magic.replace('cus', 'key')
magic

'hokey pokey'

In [4]:
# Use split() string method to split the string into 2 strings
magic = magic.split()
magic

['hokey', 'pokey']

`swapcase`, `replace`, and `split` are examples of *methods*. A method is a function that belongs to a class and typically performs an action or operation.

Methods and attributes in a class are acccessed using *dot notation*. 

The core Python classes include:
- Integers
- Floats
- Strings
- Booleans
- Lists
- Dictionaries
- Tuples
- Sets
- Frozensets
- Functions
- Ranges
- None

An *attribute* is a value associated with an object or class which is reference by name using dot notation.

For example, a Pandas DataFrame has attributes called `shape` and `columns`.

In [5]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Set-up cell to create the `planets` dataframe
# (This cell was not shown in the instructional video.)
import pandas as pd
data = [['Mercury', 2440, 0], ['Venus', 6052, 0,], ['Earth', 6371, 1],
        ['Mars', 3390, 2], ['Jupiter', 69911, 80], ['Saturn', 58232, 83],
        ['Uranus', 25362, 27], ['Neptune', 24622, 14]
]

cols = ['Planet', 'radius_km', 'moons']

planets = pd.DataFrame(data, columns=cols)

In [7]:
# Display the `planets` dataframe
planets

Unnamed: 0,Planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [8]:
# Use shape dataframe attribute to check number of rows and columns
planets.shape

(8, 3)

In [9]:
# Use columns dataframe attribute to check column names
planets.columns

Index(['Planet', 'radius_km', 'moons'], dtype='object')

Python lets you define your own classes, each with their own special attributes and methods.

For example, suppose we want to build a Spaceship class to be reused later. A class is like a blueprint for all things that share characteristics and behaviors. In this case, the class is Spaceship. There can be all different kinds of spaceships. They can have different names and different purposes. Whenever you create an object of a given class, you’re creating an instance of that class.

In [10]:
class Spaceship:
    
    # class attribute
    tractor_beam = 'off'
    
    # class constructor--called whenever a new instance of the class is created
    def __init__(self, name, kind):
        self.name = name
        self.kind = kind
        self.speed = None
    
    # instance methods
    def warp (self, warp):
        self.speed = warp
        print(f'Warp {warp}, engage!')
        
    def tractor(self):
        if self.tractor_beam == 'off':
            self.tractor_beam = 'on'
            print('Tractor beam on.')
        else:
            self.tractor_beam = 'off'
            print('Tractor beam off.')

To create an instance of a `Spaceship`, we need to supply a name and kind. Then, we can use the functions and attributes of the instance.

In [11]:
# Create an instance of the Spaceship class (i.e. "instantiate")
ship = Spaceship('Mockingbird','rescue frigate')

# Check ship's name
print(ship.name)

# Check what kind of ship it is
print(ship.kind)

# Check tractor beam status
print(ship.tractor_beam)

# Set warp speed
ship.warp(7)

# Check speed
ship.speed

# Toggle tractor beam
ship.tractor()

# Check tractor beam status
print(ship.tractor_beam)

Mockingbird
rescue frigate
off
Warp 7, engage!
Tractor beam on.
on


## Variables and data types

Variables can store values of any data type. A data type is an attribute that describes a piece of data based on its values, its programming language, or the operations it can perform.

*Assignment* means the process of storing a value in a variable. An *expression* is a combination of numbers, symbols, or other variables that produce a result when evaluated.

Python is *dynamically-typed*. This means that variables can point to objects of any type.

*Naming restrictions* are rules built into the language that must be followed. When naming variables, programmers must adhere to all naming conventions.

- *Keywords* must be avoided when naming variables. Keywords are special words that are reserved for a specific purpose and can only be used for that purpose (e.g., `for`, `in`, `if`, `else`).
- Avoid using function names in variables (e.g., `str`, `print`).
- Only include letters, numbers, and underscores. You cannot use special characters or whitespace. Variable names must start with a letter or an underscore.

Variable names are case-sensitive.

There are some best-practice naming conventions to help make code readable and maintainable:
- Descriptive names are better than cryptic abbreviations because they help other programmers (and you) read and interpret your code.
- Variable names and function names should be written in snake_case, which means that all letters are lowercase and words are separated using an underscore. 

See [PEP 8 Style Guide for Python](https://peps.python.org/pep-0008/) to review other style tips.

In [12]:
# Assign a list containing players' ages
age_list = [34, 25, 23, 19, 29]

In [13]:
# Find the maximum age and assign to `max_age` variable
max_age = max(age_list)
max_age

34

In [14]:
# Convert `max_age` to a string
max_age = str(max_age)
max_age

'34'

In [15]:
# Reassign the value of `max_age`
max_age = 'ninety-nine'
max_age

'ninety-nine'

In [16]:
# Find the maximum age and assign to `max_age` variable
max_age = max(age_list)
# Find the minimum age and assign to `min_age` variable
min_age = min(age_list)

# Subtract `min_age` from `max_age`
max_age - min_age

15

Python is able to manipulate variables using operations in expressions.

In [17]:
# Addition of 2 ints
print(7+8)

15


Similar to integers, Python can add (i.e., concatenate) two strings together.

In [18]:
# Addition of 2 strings
print("hello " + "world")

hello world


However, Python cannot add a string and an integer together.

In [19]:
# You cannot add a string to an integer
print(7+"8")

TypeError: unsupported operand type(s) for +: 'int' and 'str'

The built-in `type` function can be used to determine the type of a variable.

In [20]:
# The type() function checks the data type of an object
type("A")

str

In [21]:
# The type() function checks the data type of an object
type(2)

int

In [22]:
# The type() function checks the data type of an object
type(2.5)

float

Python will implicitly convert the result of an expression to the appropriate data type. In this example, the result of adding an integer and a float together is a float.

In [23]:
# Implicit conversion
print(1 + 2.5)

3.5


Programmers can also explicitly convert types to another type. In this example, the result of `2+2` is being converted into a string.

In [24]:
# Explicit conversion (the str() function converts a number to a string)
print("2 + 2 = " + str(2 + 2))

2 + 2 = 4


## Functions

Functions and methods are very similar, but there are a few key differences. Methods are a specific type of function. They are functions that belong to a class. 

To learn more about functions, check out the [Functions reference guide](./content/Functions.pdf) in the content directory.

There are many functions built-in to Python like `print()` and `type()`.

In [25]:
# The print() function can print text to the screen
print('Black dove, where will you go?')

Black dove, where will you go?


In [26]:
# The type() function returns an object's data type
number = 15

type(number)

int

In [27]:
# The str() function converts an object into a string
number = str(number)

type(number)

str

You can also define custom functions that accept parameters--like the function below called `greeting()`.

In [28]:
# Define a function
def greeting(name):

    print('Welcome, ' + name + '!')
    print('You are part of the team!')

greeting('Rebecca')

Welcome, Rebecca!
You are part of the team!


It is best practice to define a function for code that may need to be repeated many times, like calculating the area of a triangle. By defining the logic inside a function, it can be used to calculate the area of many different triangles.

In [29]:
# Define a function to calculate area of triangle
def area_triangle(base, height):
    return base * height / 2

In [30]:
# Use the function to assign new variables and perform calculations
area_a = area_triangle(5, 4)
area_b = area_triangle(7, 3)
total_area = area_a + area_b
total_area

20.5

In [31]:
# Define a function that converts hours, minutes, and seconds to total seconds
def get_seconds(hours, minutes, seconds):
    total_seconds = 3600*hours + 60*minutes + seconds
    return total_seconds

In [32]:
# Use the function to return a result
get_seconds(16, 45, 20)

60320

## Docstrings

Docstrings can be used to scaffold your code. Docstrings are entered at the top of a function as a multi-line string.

In [33]:
def seed_calculator(fountain_side, grass_width):
    """
    Calculate number of kilograms of grass seed needed for
    a border around a square fountain.

        Parameters:
            fountain_side (num): length of 1 side of fountain in meters
            grass_width (num): width of grass border in meters

        Returns:
            seed (float): amount of seed (kg) needed for grass border
    """
    # Area of fountain
    fountain_area = fountain_side**2
    # Total area
    total_area = (fountain_side + 2 * grass_width)**2
    # Area of grass border
    grass_area = total_area - fountain_area
    # Amount of seed needed (35 g/sq.m)
    seed = grass_area * 35
    # Convert to kg
    seed = seed / 1000

    return seed

In [34]:
seed_calculator(12, 2)

3.92

## Comparitors

Python has several built-in comparators:

| Operation | Operator |
| --- | --- |
| greater than | > |
| greater than or equal to | >= |
| less than | < |
| less than or equal to | <= |
| not equal to | != |
| equal to | == |

If you try to compare data types that aren’t compatible, like checking if a string is greater than an integer, Python will throw a `TypeError`. 

In [35]:
# > checks for greater than
print(10>1)

True


In [36]:
# == checks for equality
print("cat" == "dog")

False


In [37]:
# != checks for inequality
print(1 != 2)

True


In [38]:
# Some operators cannot be used between different data types
print(1 < "1")

TypeError: '<' not supported between instances of 'int' and 'str'

Python also contains several logical operators to create expressions:
    
| Operator | Description |
| --- | --- | 
| `and` | evaluates to True only if both statements are true |
| `or` | evaluates to True if at least one statement is true |
| `not` | reverses the evaluation |

In [39]:
# Letters that occur earlier in the alphabet evaluate to less than letters from later in the alphabet
# BOTH sides of an `and` statement must be true to return True
print("Yellow" > "Cyan" and "Brown" > "Magenta")

False


In [40]:
# An `or` statement will return True if EITHER side evaluates to True
print(25 > 50 or 1 != 2)

True


In [41]:
# `not` reverses Boolean evaluation of what follows it
print(not 42 == "Answer")

True


## Conditional Statements

In [42]:
# Define a function that checks validity of username based on length
def hint_username(username):
    if len(username) < 8:
        print("Invalid username. Must be at least 8 characters long.")
    else:
        print("Valid username.")

In [43]:
# Define a function that uses modulo to check if a number is even
def is_even(number):
    if number % 2 == 0:
        return True
    return False

In [44]:
is_even(19)

False

In [45]:
is_even(20)

True

In [46]:
# Define a function that checks validity of username based on length
def hint_username(username):
    if len(username) < 8:
        print("Invalid username. Must be at least 8 characters long.")
    elif len(username) > 15:
        print("Invalid username. Cannot exceed 15 characters.")
    else:
        print("Valid username.")

In [47]:
hint_username("ljñkljfñklasdjflkñadjglk{a")

Invalid username. Cannot exceed 15 characters.


## Loops

### While Loops

In [48]:
# Instantiate a counter
x = 0

# Create a while loop that prints "not there yet", increments x by 1, a
# and prints x until x reaches 5
while x < 5:
    print('Not there yet, x=' + str(x))
    x = x + 1
    print('x=' + str(x))

Not there yet, x=0
x=1
Not there yet, x=1
x=2
Not there yet, x=2
x=3
Not there yet, x=3
x=4
Not there yet, x=4
x=5


In [49]:
# Import the random module to be able to create a (pseudo)random number
import random

number = random.randint(1,25)                   # Generate random number
number_of_guesses = 0                           # Instantiate guess counter

while number_of_guesses < 5:
    print('Guess a number between 1 and 25: ')  # Tells user to guess number
    guess = input()                             # Produces the user input field
    guess = int(guess)                          # Convert guess to integer
    number_of_guesses += 1                      # Increment guess count by 1

    if guess == number:                         # Break while loop if guess is correct
        break
    elif number_of_guesses == 5:                # Break while loop if guess limit reached
        break
    else:                                       # Tell user to try again
        print('Nope! Try again.')

# Message to display if correct
if guess == number:
    print('Correct! You guessed the number in ' + str(number_of_guesses) + ' tries!')
# Message to display after 5 unsuccessful guesses
else:
    print('You did not guess the number. The number was ' + str(number) + '.')

Guess a number between 1 and 25: 


 5


Nope! Try again.
Guess a number between 1 and 25: 


 20


Nope! Try again.
Guess a number between 1 and 25: 


 10


Nope! Try again.
Guess a number between 1 and 25: 


 12


Nope! Try again.
Guess a number between 1 and 25: 


 16


You did not guess the number. The number was 17.


### For Loops

The `range()` function takes three arguments: `start`, `stop`, `step`. Its output is an object belonging to the range class. If you only include one argument, it will be interpreted as the stop value. The start and step values by default will be zero and one, respectively.

In [50]:
# Example of for loop with range() function
for x in range(5):
    print(x)

0
1
2
3
4


This `range()` example starts at 10, stops at 20, and steps by 2. The `stop` value in `range()` is exclusive, which means that it stop before it reaches the value.

In [51]:
for i in range(10, 20, 2):
    print(i)

10
12
14
16
18


In [52]:
# Example of reading in a .txt file line by line with a for loop
with open('zen_of_python.txt') as f:
    for line in f:
        print(line)
print('\nI\'m done.')

FileNotFoundError: [Errno 2] No such file or directory: 'zen_of_python.txt'

In [53]:
# Use a for loop to calculate 9!
product = 1
for n in range(1, 10):
    product = product * n

print(product)

362880


In [54]:
# Define a function that converts Fahrenheit to Celsius
def to_celsius(x):
     return (x-32) * 5/9

# Create a table of Celsius-->Fahrenheit conversions every 10 degrees, 0-100
for x in range(0, 101, 10):
     print(x, to_celsius(x))

0 -17.77777777777778
10 -12.222222222222221
20 -6.666666666666667
30 -1.1111111111111112
40 4.444444444444445
50 10.0
60 15.555555555555555
70 21.11111111111111
80 26.666666666666668
90 32.22222222222222
100 37.77777777777778


### Strings

In [55]:
# Adding strings will combine them
'Hello' + 'world'

'Helloworld'

In [56]:
# Blank space ("whitespace") is its own character
'Hello ' + 'world'

'Hello world'

In [57]:
# Including a whitespace when combining strings
'Hello' + ' ' + 'world'

'Hello world'

In [58]:
# Variables containing strings can be added
greeting_1 = 'Hello '
greeting_2 = 'world'
greeting_1 + greeting_2

'Hello world'

In [59]:
# Strings can be multiplied by integers
danger = 'Danger! '
danger * 3

'Danger! Danger! Danger! '

In [60]:
# Strings cannot be used with subtraction or division
danger - 2

TypeError: unsupported operand type(s) for -: 'str' and 'int'

In [61]:
# Alternate single and double quotes to include one or the other in your string
quote = '"Thank you for pressing the self-destruct button."'
print(quote)

"Thank you for pressing the self-destruct button."


In [62]:
# \ is an escape character that modifies the character that follows it
quote = "\"It's dangerous to go alone!\""
print(quote)

"It's dangerous to go alone!"


In [63]:
# \n creates a newline
greeting = "Good day,\nsir."
print(greeting)

Good day,
sir.


In [64]:
# Using escape character (\) lets you express the newline symbol within a string
newline = "\\n represents a newline in Python."
print(newline)

\n represents a newline in Python.


In [65]:
# You can loop over strings
python = 'Python'
for letter in python:
    print(letter + 'ut')

Put
yut
tut
hut
out
nut


## Working with Strings

### Indexing

**Indexing** refers to accessing a single element of a sequence by its position. In Python, the first element of any sequence has an index of zero. This means Python uses *zero-based indexing*. 

In [66]:
def find_index_value(input_value, index_num):
    print(f'Character at index position {index_num} in {input_value} is {input_value[index_num]}')

find_index_value(pets, 10)
find_index_value(pets, 4)
find_index_value(pets, 0)

NameError: name 'pets' is not defined

The string class has an `index()` function that returns the position of the input in the string, if it exists.

In [67]:
# The index() method returns index of character's first occurrence in string
pets = 'cats and dogs'
pets.index('s')

3

In [68]:
# The index() method will throw an error if character is not in string
pets.index('z')

ValueError: substring not found

In [69]:
# Access the character at a given index of a string
name = 'Jolene'
name[0]

'J'

In [70]:
# Access the character at a given index of a string
name[5]

'e'

In [71]:
# Indices that are out of range will return an IndexError
name[6]

IndexError: string index out of range

Strings can also be indexed starting at the end--called *negative indexing*.

In [72]:
# Negative indexing begins at the end of the string
sentence = 'A man, a plan, a canal, Panama!'
sentence[-1]

'!'

In [73]:
# Negative indexing begins at the end of the string
sentence[-2]

'a'

### Slicing

**Slicing** refers to accessing a range of elements from a sequence. Use square brackets containing two indices separated by a colon. 

E.g., 
- From beginning: `[:5]`  
- Until end: `[4:]`  
- Full string: `[:]`  

In [74]:
# Access a substring by using a slice
color = 'orange'
color[1:4]

'ran'

In [75]:
# Omitting the first value of the slice implies a value of 0
fruit = 'pineapple'
print(fruit[:4])

# Omitting the last value of the slice implies a value of len(string)
print(fruit[4:])

# Omitting both values results in the original string
print(fruit[:])

pine
apple
pineapple


In [76]:
# The `in` keyword returns Boolean of whether substring is in string
print('banana' in fruit)

print('apple' in fruit)

False
True


### Formatting

String formatting uses the `format()` method, which belongs to the string class. This method formats and inserts specific substrings into designated places within a larger string.

In [77]:
# Use format() method to insert values into your string, indicated by braces
name = 'Manuel'
number = 3
print('Hello {}, your lucky number is {}.'.format(name, number))

Hello Manuel, your lucky number is 3.


In [78]:
# You can assign names to designate how you want values to be inserted
name = 'Manuel'
number = 3
print('Hello {name}, your lucky number is {num}.'.format(num=number, name=name))

Hello Manuel, your lucky number is 3.


In [79]:
# You can use argument indices to designate how you want values to be inserted
print('Hello {1}, your lucky number is {0}.'.format(number, name))

Hello Manuel, your lucky number is 3.


In [80]:
# Example inserting prices into string
price = 7.75
with_tax = price * 1.07
print('Base price: ${} USD. \nWith tax: ${} USD.'.format(price, with_tax))

Base price: $7.75 USD. 
With tax: $8.2925 USD.


In [81]:
# Use :.2f to round a float value to two places beyond the decimal
print('Base price: ${:.2f} USD. \nWith tax: ${:.2f} USD.'.format(price, with_tax))

Base price: $7.75 USD. 
With tax: $8.29 USD.


In [82]:
# Define a function that converts Fahrenheit to Celsius
def to_celsius(x):
    return (x-32) * 5/9

# Create a temperature conversion table using string formatting
for x in range(0, 101, 10):
    print("{:>3} F | {:>6.2f} C".format(x, to_celsius(x)))

  0 F | -17.78 C
 10 F | -12.22 C
 20 F |  -6.67 C
 30 F |  -1.11 C
 40 F |   4.44 C
 50 F |  10.00 C
 60 F |  15.56 C
 70 F |  21.11 C
 80 F |  26.67 C
 90 F |  32.22 C
100 F |  37.78 C


**F-strings** further minimize the syntax required to embed expressions into strings. They’re called f-strings because the expressions always begin with f (or F—they’re the same). 

In [83]:
var_a = 1
var_b = 2
print(f'{var_a} + {var_b}')
print(f'{var_a + var_b}')
print(f'var_a = {var_a} \nvar_b = {var_b}')

1 + 2
3
var_a = 1 
var_b = 2


In addition to inserting expressions into strings, string formatting can format their appearance. 

For all available options, review [Python string documentation](https://docs.python.org/3/library/string.html).

Formatting floats within a string is a common task for data professionals. This can be done using the following format:

`{float:.2f}`

where `float` is the float to be formatted, `.2` represents the precision (number of decimal places), and `f` is the presentation type.

In [84]:
num = 1000.987123
f'{num:.2f}'

'1000.99'

Some common examples of presentation types include:

| Type | Meaning |
| --- | --- |
| `e` | Scientific notation |
| `f` | Fixed-point notation |
| `%` | Percentage |

In [85]:
num = 1000.987123
print(f'{num:.3e}')

decimal = 0.2497856
print(f'{decimal:.4%}')

1.001e+03
24.9786%


Regular expressions, also known as *regex*, refer to techniques that advanced data professionals use to modify and process string data.

Regex works by matching patterns in Python. It allows you to search for specific patterns of text within a string of text. Regex is used extensively in web scraping, text processing and cleaning, and data analysis. 

The first step in working with regular expressions is to import the re module. This module provides the tools necessary for working with regular expressions.

In [86]:
# regex is available in the `re` package
import re

# specify the regex pattern
pattern = 'tigers'

string_to_search = 'Three sad tigers swallowed wheat in a wheat field'

# search using the `search()` function
re.search(pattern, string_to_search)

<re.Match object; span=(10, 16), match='tigers'>

## Data Structures

### Lists

Lists are a core Python class and have a number of built-in methods that are very useful.
- `append()`: Adds an element to the end of a list
- `insert()`: Adds an element at the position specified
- `remove()`: Removes the first occurance of an item
- `pop()`: Removes an item at the given position in a list and returns it.
- `clear()`: Removes all items from the list.
- `index()`: Returns the index of the first occurrence of an item in the list.
- `count()`: Returns the number of times and item appears in the list.
- `sort()`: Sorts the list ascending by default.

To learn more about lists, review [Python lists](https://docs.python.org/3/tutorial/introduction.html#lists).

In [87]:
# Assign a list using brackets, with elements separated by commas
x = ["Now", "we", "are", "cooking", "with", 7, "ingredients"]

# Print element at index 3
print(x[3])

cooking


In [88]:
# Trying to access an index not in list will result in IndexError
print(x[7])

IndexError: list index out of range

In [89]:
# Access part of a list by slicing
x[1:3]

['we', 'are']

In [90]:
# Omitting the first value of the slice implies a value of 0
x[:2]

['Now', 'we']

In [91]:
# Omitting the last value of the slice implies a value of len(list)
x[2:]

['are', 'cooking', 'with', 7, 'ingredients']

In [92]:
# Check the data type of an object using type() function
type(x)

list

In [94]:
# The `in` keyword lets you check if a value is contained in the list
x = ["Now", "we", "are", "cooking", "with", 7, "ingredients"]

check_for_entry_in_list(x, "This")
check_for_entry_in_list(x, "cooking")

def check_for_entry_in_list(l, w):
    print(f"Checking if '{w}' is in the list... {w in l}")

Checking if 'This' is in the list... False
Checking if 'cooking' is in the list... True


In [95]:
# The append() method adds an element to the end of a list
fruits = ['Pineapple', 'Banana', 'Apple', 'Melon']
fruits.append('Kiwi')
print(fruits)

['Pineapple', 'Banana', 'Apple', 'Melon', 'Kiwi']


In [96]:
# The insert() method adds an element to a list at the specified index
fruits.insert(1, 'Orange')
print(fruits)

['Pineapple', 'Orange', 'Banana', 'Apple', 'Melon', 'Kiwi']


In [97]:
# The insert() method adds an element to a list at the specified index
fruits.insert(0, 'Mango')
print(fruits)

['Mango', 'Pineapple', 'Orange', 'Banana', 'Apple', 'Melon', 'Kiwi']


In [98]:
# The remove() method deletes the first occurrence of an element in a list
fruits.remove('Banana')
print(fruits)

['Mango', 'Pineapple', 'Orange', 'Apple', 'Melon', 'Kiwi']


In [99]:
# Trying to remove an element that doesn't exist results in an error
fruits.remove('Strawberry')
print(fruits)

ValueError: list.remove(x): x not in list

In [100]:
# The pop() method removes the element at a given index and returns it.
# If no index is given, it removes and returns the last element.
fruits.pop(2)
print(fruits)

['Mango', 'Pineapple', 'Apple', 'Melon', 'Kiwi']


In [101]:
# Reassign the element at a given index with a new value
fruits[1] = 'Mango'

In [102]:
print(fruits)

['Mango', 'Mango', 'Apple', 'Melon', 'Kiwi']


In [103]:
# Strings are immutable because you need to reassign them to modify them
power = '1.21'
power = power + ' gigawatts'
print(power)

1.21 gigawatts


In [104]:
# You cannot reassign a specific character within a string
power[0] = '2'

TypeError: 'str' object does not support item assignment

In [105]:
# Lists are mutable because you can overwrite their elements
power = [1.21, 'gigawatts']
power[0] = 2.21
print(power)

[2.21, 'gigawatts']


#### List Comprehension

List comprehension offers a shorter syntax when you want to create a new list based on the values of an existing list.

`newlist = [expression for item in iterable if condition == True]`

In [157]:
# Uses a `for` loop
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
newlist = []

for x in fruits:
  if "a" in x:
    newlist.append(x)

print(newlist)

['apple', 'banana', 'mango']


In [160]:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]

# Creates a list of fruits that have the letter `a` in them
newlist = [x for x in fruits if "a" in x]
print(newlist)

# Concatenates `123` onto the end of each fruit in the list
newlist = [x + '123' for x in fruits]
print(newlist)

['apple', 'banana', 'mango']
['apple123', 'banana123', 'cherry123', 'kiwi123', 'mango123']


### Tuples

In [107]:
# Tuples are instantiated with parentheses
fullname = ('Masha', 'Z', 'Hopper')

fullname

('Masha', 'Z', 'Hopper')

In [108]:
# Tuples are immutable, so their elements cannot be overwritten
fullname[2] = 'Copper'
print(fullname)

TypeError: 'tuple' object does not support item assignment

In [109]:
# You can combine tuples using addition
fullname = fullname + ('Jr',)
print(fullname)

('Masha', 'Z', 'Hopper', 'Jr')


In [110]:
# The tuple() function converts an object's data type to tuple
fullname = ['Masha', 'Z', 'Hopper']
fullname = tuple(fullname)
print(fullname)

('Masha', 'Z', 'Hopper')


In [111]:
# Functions that return multiple values return them in a tuple
def to_dollars_cents(price):
    '''
    Split price (float) into dollars and cents.
    '''
    dollars = int(price // 1)
    cents = round(price % 1 * 100)

    return dollars, cents

In [112]:
# Functions that return multiple values return them in a tuple
to_dollars_cents(6.55)

(6, 55)

In [113]:
# "Unpacking" a tuple allows a tuple's elements to be assigned to variables
dollars, cents = to_dollars_cents(6.55)
print(dollars + 1)
print(cents + 1)

7
56


In [114]:
# The data type of an element of an unpacked tuple is not necessarily a tuple
type(dollars)

int

In [115]:
# Create a list of tuples, each representing the name, age, and position of a
# player on a basketball team
team = [('Marta', 20, 'center'),
        ('Ana', 22, 'point guard'),
        ('Gabi', 22, 'shooting guard'),
        ('Luz', 21, 'power forward'),
        ('Lorena', 19, 'small forward'),
        ]

In [116]:
# Use a for loop to loop over the list, unpack the tuple at each iteration, and
# print one of the values
for name, age, position in team:
    print(name)

Marta
Ana
Gabi
Luz
Lorena


In [117]:
# This code produces the same result as the code in the cell above
for player in team:
    print(player[0])

Marta
Ana
Gabi
Luz
Lorena


In [118]:
# Create a function to extract and names and positions from the team list and
# format them to be printed. Returns a list.
def player_position(players):
    result = []
    for name, age, position in players:
        result.append('Name: {:>19} \nPosition: {:>15}\n'.format(name, position))

    return result

In [119]:
# Loop over the list of formatted names and positions produced by
# player_position() function and print them
for player in player_position(team):
    print(player)

Name:               Marta 
Position:          center

Name:                 Ana 
Position:     point guard

Name:                Gabi 
Position:  shooting guard

Name:                 Luz 
Position:   power forward

Name:              Lorena 
Position:   small forward



In [120]:
# Nested loops can produce the different combinations of pips (dots) in
# a set of dominoes
for left in range(7):
    for right in range(left, 7):
        print(f"[{left}|{right}]", end=" ")
    print('\n')

[0|0] [0|1] [0|2] [0|3] [0|4] [0|5] [0|6] 

[1|1] [1|2] [1|3] [1|4] [1|5] [1|6] 

[2|2] [2|3] [2|4] [2|5] [2|6] 

[3|3] [3|4] [3|5] [3|6] 

[4|4] [4|5] [4|6] 

[5|5] [5|6] 

[6|6] 



In [121]:
# Create a list of dominoes, with each domino reprented as a tuple
dominoes = []
for left in range(7):
    for right in range(left, 7):
        dominoes.append((left, right))
dominoes

[(0, 0),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (0, 5),
 (0, 6),
 (1, 1),
 (1, 2),
 (1, 3),
 (1, 4),
 (1, 5),
 (1, 6),
 (2, 2),
 (2, 3),
 (2, 4),
 (2, 5),
 (2, 6),
 (3, 3),
 (3, 4),
 (3, 5),
 (3, 6),
 (4, 4),
 (4, 5),
 (4, 6),
 (5, 5),
 (5, 6),
 (6, 6)]

In [122]:
# Select index 1 of the tuple at index 4 in the list of dominoes
dominoes[4][1]

4

In [123]:
# You can use a for loop to sum the pips on each domino and append
# the sum to a new list
pips_from_loop = []
for domino in dominoes:
    pips_from_loop.append(domino[0] + domino[1])
print(pips_from_loop)

[0, 1, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7, 8, 6, 7, 8, 9, 8, 9, 10, 10, 11, 12]


In [124]:
# A list comprehension produces the same result with less code
pips_from_list_comp = [domino[0] + domino[1] for domino in dominoes]
pips_from_loop == pips_from_list_comp

True

### Iterables

Python's core iterable sequence data structures include strings, lists, and tuples (among others). The table below summarizes the differences between these data types.

| Data Type | Syntax | Content | Mutability | Usage |
| --- | --- | --- | --- | --- |
| String | `"String"` | Strings can contain any character--letters, numbers, punctuation marks, spaces--but everything between the opening and closing quotation marks is part of the same simple string. | Immutable: any operation that appears to modify a string actually creates a new string. | Commonly used to represent text data. |
| List | `[1, 2, 3, 4]` | Lists can contain any data type, in any combination. So a single list can contain strings, integers, floats, tuples, dictionaries, and other lists. | Mutable | Storing collections of related data <br> Storing collections of items that you want to iterate over <br> Collection of items can must be sorted or searched | 
| Tuple | `(1, 2, 3)` <br> `tuple()` | Tuples can contain any kind of data and in any combination, including strings, integers, floats, lists, dicts, and other tuples. | Immutable | Returning mutliple values from a function <br> Packing and unpacking sequences <br> Dictionary keys | Data Integrity (immutability) |

The `zip()` function can be used to combine multiple iterables together.

`zip(...iterable)`

In [156]:
# List of states
state = ['Pennsylvania', 'California', 'New York']

# List of counties
counties = ['Chester', 'Orange', 'New York']

# `zip()` combines multiple iterables into an iterable of tuples in the form of (list1[0], list2[0], listn[0])
state_counties = zip(state, counties)

# results in an iterable object (zip)
state_counties

# convert to a list
list(state_counties)

[('Pennsylvania', 'Chester'),
 ('California', 'Orange'),
 ('New York', 'New York')]

### Dictionaries

Dictionaries are useful when you need a data structure to store information that can be referenced or looked up.

In [125]:
# Create a dictionary with pens as keys and the animals they contain as values.
# Dictionaries can be instantiated using braces.
zoo = {
    'pen_1': 'penguins',
    'pen_2': 'zebras',
    'pen_3': 'lions',
    }

# Selecting the `pen_2` key returns `zebras`--the value stored at that key
zoo['pen_2']

'zebras'

In [126]:
# You cannot access a dictionary's values by name using bracket indexing
# because the computer interprets this as a key, not a value
zoo['zebras']

KeyError: 'zebras'

In [127]:
# Dictionaries can also be instantiated using the dict() function
zoo = dict(
    pen_1='monkeys',
    pen_2='zebras',
    pen_3='lions',
    )

zoo['pen_2']

'zebras'

In [128]:
# Another way to create a dictionary using the dict() function
zoo = dict(
    [
     ['pen_1', 'monkeys'],
     ['pen_2', 'zebras'],
     ['pen_3', 'lions'],
    ]
)

zoo['pen_2']

'zebras'

In [129]:
# Assign a new key:value pair to an existing dictionary
zoo['pen_4'] = 'crocodiles'
zoo

{'pen_1': 'monkeys',
 'pen_2': 'zebras',
 'pen_3': 'lions',
 'pen_4': 'crocodiles'}

In [130]:
# Dictionaries are unordered and do not support numerical indexing
zoo[2]

KeyError: 2

In [131]:
# Use the `in` keyword to produce a Boolean of whether a given key exists in a dictionary
print('pen_1' in zoo)
print('pen_7' in zoo)

True
False


In [132]:
# Create a list of tuples, each representing the name, age, and position of a
# player on a basketball team
team = [
    ('Marta', 20, 'center'),
    ('Ana', 22, 'point guard'),
    ('Gabi', 22, 'shooting guard'),
    ('Luz', 21, 'power forward'),
    ('Lorena', 19, 'small forward'),
    ]

In [133]:
# Add new players to the list
team = [
    ('Marta', 20, 'center'),
    ('Ana', 22, 'point guard'),
    ('Gabi', 22, 'shooting guard'),
    ('Luz', 21, 'power forward'),
    ('Lorena', 19, 'small forward'),
    ('Sandra', 19, 'center'),
    ('Mari', 18, 'point guard'),
    ('Esme', 18, 'shooting guard'),
    ('Lin', 18, 'power forward'),
    ('Sol', 19, 'small forward'),
    ]

In [134]:
# Instantiate an empty dictionary
new_team = {}

# Loop over the tuples in the list of players and unpack their values
for name, age, position in team:
    if position in new_team:                    # If position already a key in new_team,
        new_team[position].append((name, age))  # append (name, age) tup to list at that value
    else:
        new_team[position] = [(name, age)]      # If position not a key in new_team,
                                                # create a new key whose value is a list
                                                # containing (name, age) tup
new_team

{'center': [('Marta', 20), ('Sandra', 19)],
 'point guard': [('Ana', 22), ('Mari', 18)],
 'shooting guard': [('Gabi', 22), ('Esme', 18)],
 'power forward': [('Luz', 21), ('Lin', 18)],
 'small forward': [('Lorena', 19), ('Sol', 19)]}

In [135]:
# Examine the value at the 'point guard' key
new_team['point guard']

[('Ana', 22), ('Mari', 18)]

In [136]:
# You can access the a dictionary's keys by looping over them
for x in new_team:
    print(x)

center
point guard
shooting guard
power forward
small forward


In [137]:
# The keys() method returns the keys of a dictionary
new_team.keys()

dict_keys(['center', 'point guard', 'shooting guard', 'power forward', 'small forward'])

In [138]:
# The values() method returns all the values in a dictionary
new_team.values()

dict_values([[('Marta', 20), ('Sandra', 19)], [('Ana', 22), ('Mari', 18)], [('Gabi', 22), ('Esme', 18)], [('Luz', 21), ('Lin', 18)], [('Lorena', 19), ('Sol', 19)]])

In [139]:
# The items() method returns both the keys and the values
for a, b in new_team.items():
    print(a, b)

center [('Marta', 20), ('Sandra', 19)]
point guard [('Ana', 22), ('Mari', 18)]
shooting guard [('Gabi', 22), ('Esme', 18)]
power forward [('Luz', 21), ('Lin', 18)]
small forward [('Lorena', 19), ('Sol', 19)]


### Sets

In [140]:
# The set() function converts a list to a set
x = set(['foo', 'bar', 'baz', 'foo'])
print(x)

{'baz', 'foo', 'bar'}


In [141]:
# The set() function converts a tuple to a set
x = set(('foo','bar','baz', 'foo'))
print(x)

{'baz', 'foo', 'bar'}


In [142]:
# The set() function converts a string to a set
x = set('foo')
print(x)

{'f', 'o'}


In [143]:
# You can use braces to instantiate a set
x = {'foo'}
print(type(x))

# But empty braces are reserved for dictionaries
y = {}
print(type(y))

<class 'set'>
<class 'dict'>


In [144]:
# Instantiating a set with braces treats the contents as literals
x = {'foo'}
print(x)

{'foo'}


In [145]:
# The intersection() method (&) returns common elements between two sets
set1 = {1, 2, 3, 4, 5, 6}
set2 = {4, 5, 6, 7, 8, 9}
print(set1.intersection(set2))
print(set1 & set2)

{4, 5, 6}
{4, 5, 6}


In [146]:
# The union() method (|) returns all the elements from two sets, each represented once
x1 = {'foo', 'bar', 'baz'}
x2 = {'baz', 'qux', 'quux'}
print(x1.union(x2))
print(x1 | x2)

{'qux', 'baz', 'quux', 'foo', 'bar'}
{'qux', 'baz', 'quux', 'foo', 'bar'}


In [147]:
# The difference() method (-) returns the elements in set1 that aren't in set2
set1 = {1, 2, 3, 4, 5, 6}
set2 = {4, 5, 6, 7, 8, 9}
print(set1.difference(set2))
print(set1 - set2)

{1, 2, 3}
{1, 2, 3}


In [148]:
# ... and the elements in set2 that aren't in set1
print(set2.difference(set1))
print(set2 - set1)

{8, 9, 7}
{8, 9, 7}


In [149]:
# The symmetric_difference() method (^) returns all the values from each set that
# are not in both sets.
set1 = {1, 2, 3, 4, 5, 6}
set2 = {4, 5, 6, 7, 8, 9}
set2.symmetric_difference(set1)
set2 ^ set1

{1, 2, 3, 7, 8, 9}

## NumPy

NumPy's power comes from *vectorization*, which enables operations to be performed on multiple components of a data object at the same time.

Supposed we wanted to create a new list that contains the product of the values of two other lists so that...

```
r[0] = l1[0] * l2[0]
r[1] = l1[1] * l2[1]
r[n] = l1[n] * l2[n]
```

In [8]:
list_a = [1, 2, 3]
list_b = [2, 4, 6]


# Expected result: [2, 8, 18]

Given the following lists, we could--naively--use a `for` loop to perform this calculation.

In [13]:
result = []
for i in range(len(list_a)):
    result.append(list_a[i] * list_b[i])
    
result

[2, 8, 18]

This approach works, but would perform poorly on very large lists.

NumPy has ability to perform calculations like this in parallel on large data sets.

In [15]:
import numpy as np

# create arrays to represent the input lists
array_a = np.array(list_a)
array_b = np.array(list_b)

# Perform element-wise multiplication between the arrays
array_a * array_b

array([ 2,  8, 18])

### Array Operations

The N-dimensional array (`ndarray`) is the core data object of NumPy. `ndarray`s are vectors.

In [16]:
import numpy as np

# The np.array() function converts an object to an ndarray
x = np.array([1, 2, 3, 4])
x

array([1, 2, 3, 4])

Arrays are mutable which means you can change items in the array. However, arrays are not resizable so you cannot add more items to the array.

In [17]:
# Arrays can be indexed
x[-1] = 5
x

array([1, 2, 3, 5])

In [18]:
# Trying to access an index that doesn't exist will throw an error
x[4] = 10

IndexError: index 4 is out of bounds for axis 0 with size 4

All items in an array must of the same data type. If different typed items are added to an array, NumPy will do its best to convert all items to the same type.

In the next example, notice how NumPy converts all items to strings.

In [19]:
# Arrays cast every element they contain as the same data type
arr = np.array([1, 2, 'coconut'])
arr

array(['1', '2', 'coconut'], dtype='<U21')

In [20]:
# NumPy arrays are a class called `ndarray`
print(type(arr))

<class 'numpy.ndarray'>


The `dtype` attribute is used to check the data type of the contents of an array.

In [21]:
# The dtype attribute returns the data type of an array's contents
arr = np.array([1, 2, 3])
arr.dtype

dtype('int64')

As the `ndarray` implies, arrays can be multi-dimensional. A one-dimensional array has neither rows or columns. 

The `shape` attribute can be used to check the shape of an array.
The `ndim` attribute can be used to check the number of dimensions of an array.

In [22]:
# The shape attribute returns the number of elements in each dimension
# of an array
arr.shape

(3,)

In [23]:
# The ndim attribute returns the number of dimensions in an array
arr.ndim

1

In [24]:
# Create a 2D array by passing a list of lists to np.array() function
arr_2d = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print(arr_2d.shape)
print(arr_2d.ndim)
arr_2d

(4, 2)
2


array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [25]:
# Create a 3D aray by passing a list of two lists of lists to np.array() function
arr_3d = np.array([[[1, 2, 3],
                   [3, 4, 5]],

                  [[5, 6, 7],
                   [7, 8, 9]]]
)

print(arr_3d.shape)
print(arr_3d.ndim)
arr_3d

(2, 2, 3)
3


array([[[1, 2, 3],
        [3, 4, 5]],

       [[5, 6, 7],
        [7, 8, 9]]])

NumPy allows developers to change the shape of an array using the `reshape()` method.

In [26]:
# The reshape() method changes the shape of an array
arr_2d = arr_2d.reshape(2, 4)
arr_2d

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [27]:
# Create new array
arr = np.array([1, 2, 3, 4, 5])

# The mean() method returns the mean of the elements in an array
np.mean(arr)

3.0

In [28]:
# The log() method returns the natural logarithm of the elements in an array
np.log(arr)

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791])

In [29]:
# The floor() method returns the value of a number rounded down
# to the nearest integer
np.floor(5.7)

5.0

In [30]:
# The floor() method returns the value of a number rounded up
# to the nearest integer
np.ceil(5.3)

6.0

## Pandas

In [1]:
# NumPy and pandas are typically imported together.
# np and pd are conventional aliases.
import numpy as np
import pandas as pd

In [2]:
# Read in data from a .csv file
dataframe = pd.read_csv('https://raw.githubusercontent.com/adacert/titanic/main/train.csv')

# Print the first 25 rows
dataframe.head(25)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [3]:
# Calculate the mean of the Age column
dataframe['Age'].mean()

29.69911764705882

In [4]:
# Calculate the maximum value contained in the Age column
dataframe['Age'].max()

80.0

In [5]:
# Calculate the minimum value contained in the Age column
dataframe['Age'].min()

0.42

In [6]:
# Calculate the standard deviation of the values in the Age column
dataframe['Age'].std()

14.526497332334042

In [7]:
# Return the number of rows that share the same value in the Pclass column
dataframe['Pclass'].value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [8]:
# The describe() method returns summary statistics of the dataframe
dataframe.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
# Filter the data to return only rows where value in Age column is greater than 60
# and value in Pclass column equals 3
dataframe[(dataframe['Age'] > 60) & (dataframe['Pclass'] == 3)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [10]:
# Create a new column called 2023_Fare that contains the inflation-adjusted
# fare of each ticket in 2023 pounds
dataframe['2023_Fare'] = dataframe['Fare'] * 146.14
dataframe

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2023_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1059.515000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1158.159500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,7760.034000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1176.427000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1899.820000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,4384.200000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,3426.983000
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,4384.200000


In [11]:
# Use iloc to access data using index numbers.
# Select row 1, column 3.
dataframe.iloc[1][3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [12]:
# Group customers by Sex and Pclass and calculate the total paid for each group
# and the mean price paid for each group
fare = dataframe.groupby(['Sex', 'Pclass']).agg({'Fare': ['count', 'sum']})
fare['fare_avg'] = fare['Fare']['sum'] / fare['Fare']['count']
fare

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Fare,fare_avg
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,Unnamed: 4_level_1
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,1,94,9975.825,106.125798
female,2,76,1669.7292,21.970121
female,3,144,2321.1086,16.11881
male,1,122,8201.5875,67.226127
male,2,108,2132.1125,19.741782
male,3,347,4393.5865,12.661633


In [13]:
import pandas as pd

# Use pd.DataFrame() function to create a dataframe from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=data)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [14]:
# Use pd.DataFrame() function to create a dataframe from a numpy array
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
df2

Unnamed: 0,a,b,c
x,1,2,3
y,4,5,6
z,7,8,9


In [17]:
# Use pd.read_csv() function to create a dataframe from a .csv file
# from a URL or filepath
df3 = pd.read_csv('https://raw.githubusercontent.com/adacert/titanic/main/train.csv')
df3.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
# Print class of first row
print(type(df3.iloc[0]))

# Print class of "Name" column
print(type(df3['Name']))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [20]:
# Create a copy of df3 named 'titanic'
titanic = df3

# The head() method outputs the first 5 rows of dataframe
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
# The columns attribute returns an Index object containing the dataframe's columns
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [22]:
# The shape attribute returns the shape of the dataframe (rows, columns)
titanic.shape

(891, 12)

In [23]:
# The info() method returns summary information about the dataframe
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [24]:
# You can select a column by name using brackets
titanic['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [25]:
# You can select a column by name using dot notation
# only when its name contains no spaces or special characters
titanic.Age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [26]:
# You can create a DataFrame object of specific columns using a list
# of column names inside brackets
titanic[['Name', 'Age']]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


In [27]:
# Use iloc to return a DataFrame view of the data in row 0
titanic.iloc[[0]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [28]:
# Use iloc to return a DataFrame view of the data in rows 0, 1, 2
titanic.iloc[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [29]:
# Use iloc to return a DataFrame view of rows 0-2 at columns 3 and 4
titanic.iloc[0:3, [3, 4]]

Unnamed: 0,Name,Sex
0,"Braund, Mr. Owen Harris",male
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss. Laina",female


In [30]:
# Use iloc to return a DataFrame view of all rows at column 3
titanic.iloc[:, [3]]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [31]:
# Use iloc to access value in row 0, column 3
titanic.iloc[0, 3]

'Braund, Mr. Owen Harris'

In [32]:
# Use loc to access values in rows 0-3 at just the Name column
titanic.loc[0:3, ['Name']]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"


In [33]:
# Create a new column in the dataframe containing the value in the Age column + 100
titanic['Age_plus_100'] = titanic['Age'] + 100
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_plus_100
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,122.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,138.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,126.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,135.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,135.0


### Boolean Masking

*Boolean masking*, also called *boolean indexing*, is a feature in Python NumPy and Pandas that allows for the filtering of values in numpy arrays or Pandas data frames.

In [36]:
# Instantiate a dictionary of planetary data
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
       'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
       'moons': [0, 0, 1, 2, 80, 83, 27, 14]
        }

# Use pd.DataFrame() function to convert dictionary to dataframe
planets = pd.DataFrame(data)
planets

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [42]:
# Create a Boolean mask of planets with fewer than 20 moons
mask = planets['moons'] < 20

print(f'Boolean masks are of type {type(mask)}')
mask

Boolean masks are of type <class 'pandas.core.series.Series'>


0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: moons, dtype: bool

In [38]:
# Apply the Boolean mask to the dataframe to filter it so it contains
# only the planets with fewer than 20 moons
planets[mask]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


In [43]:
# Define the Boolean mask and apply it in a single line
planets[planets['moons'] < 20]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


In [44]:
# Boolean masks don't change the data. They're just views.
planets

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [45]:
# You can assign a dataframe view to a named variable
moons_under_20 = planets[mask]
moons_under_20

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
7,Neptune,24622,14


In [46]:
# Create a Boolean mask of planets with fewer than 10 moons OR more than 50 moons
mask = (planets['moons'] < 10) | (planets['moons'] > 50)
mask

0     True
1     True
2     True
3     True
4     True
5     True
6    False
7    False
Name: moons, dtype: bool

In [47]:
# Apply the Boolean mask to filter the data
planets[mask]

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83


In [48]:
# Create a Boolean mask of planets with more than 20 moons, excluding them if they
# have 80 moons or if their radius is less than 50,000 km.
mask = (planets['moons'] > 20) & ~(planets['moons'] == 80) & ~(planets['radius_km'] < 50000)

# Apply the mask
planets[mask]

Unnamed: 0,planet,radius_km,moons
5,Saturn,58232,83


## Grouping and Aggregation

In [50]:
import numpy as np
import pandas as pd

# Instantiate a dictionary of planetary data
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars',
                   'Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [2440, 6052, 6371, 3390, 69911, 58232,
                     25362, 24622],
        'moons': [0, 0, 1, 2, 80, 83, 27, 14],
        'type': ['terrestrial', 'terrestrial', 'terrestrial', 'terrestrial',
                 'gas giant', 'gas giant', 'ice giant', 'ice giant'],
        'rings': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes','yes'],
        'mean_temp_c': [167, 464, 15, -65, -110, -140, -195, -200],
        'magnetic_field': ['yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes']
        }

# Use pd.DataFrame() function to convert dictionary to dataframe
planets = pd.DataFrame(data)
planets

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field
0,Mercury,2440,0,terrestrial,no,167,yes
1,Venus,6052,0,terrestrial,no,464,no
2,Earth,6371,1,terrestrial,no,15,yes
3,Mars,3390,2,terrestrial,no,-65,no
4,Jupiter,69911,80,gas giant,yes,-110,yes
5,Saturn,58232,83,gas giant,yes,-140,yes
6,Uranus,25362,27,ice giant,yes,-195,yes
7,Neptune,24622,14,ice giant,yes,-200,yes


The `DataFrameGroupBy` object is used for grouping.

In [51]:
# The groupby() function returns a groupby object
planets.groupby(['type'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x116dbf970>

In [53]:
# Apply the sum() function to the groupby object to get the sum
# of the values in each numerical column for each group
planets.groupby(['type']).sum()

Unnamed: 0_level_0,planet,radius_km,moons,rings,mean_temp_c,magnetic_field
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gas giant,JupiterSaturn,128143,163,yesyes,-250,yesyes
ice giant,UranusNeptune,49984,41,yesyes,-395,yesyes
terrestrial,MercuryVenusEarthMars,18253,3,nononono,581,yesnoyesno


In [67]:
# Apply the sum function to the groupby object and select
# only the 'moons' column
planets.groupby(['type']).sum()[['moons']]

pandas.core.frame.DataFrame

In [75]:
# Create a df from the numeric fields and the fields we want to aggregate by
dfn = planets[['moons', 'radius_km', 'mean_temp_c', 'type', 'magnetic_field']]

# Group by type and magnetic_field and get the mean of the values
dfn.groupby(['type', 'magnetic_field']).mean()[['moons', 'radius_km', 'mean_temp_c']]
# planets.groupby(['type', 'magnetic_field']).mean()['moons']

Unnamed: 0_level_0,Unnamed: 1_level_0,moons,radius_km,mean_temp_c
type,magnetic_field,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
gas giant,yes,81.5,64071.5,-125.0
ice giant,yes,20.5,24992.0,-197.5
terrestrial,no,1.0,4721.0,199.5
terrestrial,yes,0.5,4405.5,91.0


In [78]:
# Group by type, then use the agg() function to get the mean and median
# of the values in the numeric columns for each group
dfn = planets[['moons', 'radius_km', 'mean_temp_c', 'type']]

dfn.groupby(['type']).agg(['mean', 'median'])

Unnamed: 0_level_0,moons,moons,radius_km,radius_km,mean_temp_c,mean_temp_c
Unnamed: 0_level_1,mean,median,mean,median,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
gas giant,81.5,81.5,64071.5,64071.5,-125.0,-125.0
ice giant,20.5,20.5,24992.0,24992.0,-197.5,-197.5
terrestrial,0.75,0.5,4563.25,4721.0,145.25,91.0


In [79]:
# Group by type and magnetic_field, then use the agg() function to get the
# mean and max of the values in the numeric columns for each group
dfn = planets[['moons', 'radius_km', 'mean_temp_c', 'type', 'magnetic_field']]
dfn.groupby(['type', 'magnetic_field']).agg(['mean', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,moons,moons,radius_km,radius_km,mean_temp_c,mean_temp_c
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,max,mean,max,mean,max
type,magnetic_field,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
gas giant,yes,81.5,83,64071.5,69911,-125.0,-110
ice giant,yes,20.5,27,24992.0,25362,-197.5,-195
terrestrial,no,1.0,2,4721.0,6052,199.5,464
terrestrial,yes,0.5,1,4405.5,6371,91.0,167


In [80]:
# Define a function that returns the 90 percentile of an array
def percentile_90(x):
    return x.quantile(0.9)

In [81]:
# Group by type and magnetic_field, then use the agg() function to apply the
# mean and the custom-defined `percentile_90()` function to the numeric
# columns for each group
dfn = planets[['moons', 'radius_km', 'mean_temp_c', 'type', 'magnetic_field']]
dfn.groupby(['type', 'magnetic_field']).agg(['mean', percentile_90])

Unnamed: 0_level_0,Unnamed: 1_level_0,moons,moons,radius_km,radius_km,mean_temp_c,mean_temp_c
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,percentile_90,mean,percentile_90,mean,percentile_90
type,magnetic_field,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
gas giant,yes,81.5,82.7,64071.5,68743.1,-125.0,-113.0
ice giant,yes,20.5,25.7,24992.0,25288.0,-197.5,-195.5
terrestrial,no,1.0,1.8,4721.0,5785.8,199.5,411.1
terrestrial,yes,0.5,0.9,4405.5,5977.9,91.0,151.8


## Merging and Joining

For two dataframes that are formatted identically, the `concat()` function can be used to join the dataframes together into one.

For two dataframes that are not formatted identically, the `merge()` function can be used to create a new dataframe from the columns of both source dataframes.

In [84]:
import numpy as np
import pandas as pd

# Instantiate a dictionary of planetary data
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars'],
        'radius_km': [2440, 6052, 6371, 3390],
        'moons': [0, 0, 1, 2],
        }
# Use pd.DataFrame() function to convert dictionary to dataframe
df1 = pd.DataFrame(data)
df1

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2


In [85]:
# Instantiate a dictionary of planetary data
data = {'planet': ['Jupiter', 'Saturn', 'Uranus', 'Neptune'],
        'radius_km': [69911, 58232, 25362, 24622],
        'moons': [80, 83, 27, 14],
        }
# Use pd.DataFrame() function to convert dictionary to dataframe
df2 = pd.DataFrame(data)
df2

Unnamed: 0,planet,radius_km,moons
0,Jupiter,69911,80
1,Saturn,58232,83
2,Uranus,25362,27
3,Neptune,24622,14


The `pd.concat()` function can combine the two dataframes along axis 0, with the second dataframe being added as new rows to the first dataframe.

In [86]:
# The pd.concat() function can combine the two dataframes along axis 0,
# with the second dataframe being added as new rows to the first dataframe
df3 = pd.concat([df1, df2], axis=0)
df3

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
0,Jupiter,69911,80
1,Saturn,58232,83
2,Uranus,25362,27
3,Neptune,24622,14


In [87]:
# Reset the row indices
df3 = df3.reset_index(drop=True)
df3

Unnamed: 0,planet,radius_km,moons
0,Mercury,2440,0
1,Venus,6052,0
2,Earth,6371,1
3,Mars,3390,2
4,Jupiter,69911,80
5,Saturn,58232,83
6,Uranus,25362,27
7,Neptune,24622,14


In [88]:
# NOTE: THIS CELL WAS NOT SHOWN IN THE INSTRUCTIONAL VIDEO, BUT WAS RUN AS A
#       SETUP CELL
data = {'planet': ['Earth', 'Mars','Jupiter', 'Saturn', 'Uranus',
                   'Neptune', 'Janssen', 'Tadmor'],
        'type': ['terrestrial', 'terrestrial','gas giant', 'gas giant',
                 'ice giant', 'ice giant', 'super earth','gas giant'],
        'rings': ['no', 'no', 'yes', 'yes', 'yes','yes', 'no', None],
        'mean_temp_c': [15, -65, -110, -140, -195, -200, None, None],
        'magnetic_field': ['yes', 'no', 'yes', 'yes', 'yes', 'yes', None, None],
        'life': [1, 0, 0, 0, 0, 0, 1, 1]
        }
df4 = pd.DataFrame(data)

In [89]:
df4

Unnamed: 0,planet,type,rings,mean_temp_c,magnetic_field,life
0,Earth,terrestrial,no,15.0,yes,1
1,Mars,terrestrial,no,-65.0,no,0
2,Jupiter,gas giant,yes,-110.0,yes,0
3,Saturn,gas giant,yes,-140.0,yes,0
4,Uranus,ice giant,yes,-195.0,yes,0
5,Neptune,ice giant,yes,-200.0,yes,0
6,Janssen,super earth,no,,,1
7,Tadmor,gas giant,,,,1


In [90]:
# Use pd.merge() to combine dataframes.
# Inner merge retains only keys that appear in both dataframes.
inner = pd.merge(df3, df4, on='planet', how='inner')
inner

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field,life
0,Earth,6371,1,terrestrial,no,15.0,yes,1
1,Mars,3390,2,terrestrial,no,-65.0,no,0
2,Jupiter,69911,80,gas giant,yes,-110.0,yes,0
3,Saturn,58232,83,gas giant,yes,-140.0,yes,0
4,Uranus,25362,27,ice giant,yes,-195.0,yes,0
5,Neptune,24622,14,ice giant,yes,-200.0,yes,0


In [91]:
# Use pd.merge() to combine dataframes.
# Outer merge retains all keys from both dataframes.
outer = pd.merge(df3, df4, on='planet', how='outer')
outer

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field,life
0,Mercury,2440.0,0.0,,,,,
1,Venus,6052.0,0.0,,,,,
2,Earth,6371.0,1.0,terrestrial,no,15.0,yes,1.0
3,Mars,3390.0,2.0,terrestrial,no,-65.0,no,0.0
4,Jupiter,69911.0,80.0,gas giant,yes,-110.0,yes,0.0
5,Saturn,58232.0,83.0,gas giant,yes,-140.0,yes,0.0
6,Uranus,25362.0,27.0,ice giant,yes,-195.0,yes,0.0
7,Neptune,24622.0,14.0,ice giant,yes,-200.0,yes,0.0
8,Janssen,,,super earth,no,,,1.0
9,Tadmor,,,gas giant,,,,1.0


In [92]:
# Use pd.merge() to combine dataframes.
# Left merge retains only keys that appear in the left dataframe.
left = pd.merge(df3, df4, on='planet', how='left')
left

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field,life
0,Mercury,2440,0,,,,,
1,Venus,6052,0,,,,,
2,Earth,6371,1,terrestrial,no,15.0,yes,1.0
3,Mars,3390,2,terrestrial,no,-65.0,no,0.0
4,Jupiter,69911,80,gas giant,yes,-110.0,yes,0.0
5,Saturn,58232,83,gas giant,yes,-140.0,yes,0.0
6,Uranus,25362,27,ice giant,yes,-195.0,yes,0.0
7,Neptune,24622,14,ice giant,yes,-200.0,yes,0.0


In [93]:
# Use pd.merge() to combine dataframes.
# Right merge retains only keys that appear in right dataframe.
right = pd.merge(df3, df4, on='planet', how='right')
right

Unnamed: 0,planet,radius_km,moons,type,rings,mean_temp_c,magnetic_field,life
0,Earth,6371.0,1.0,terrestrial,no,15.0,yes,1
1,Mars,3390.0,2.0,terrestrial,no,-65.0,no,0
2,Jupiter,69911.0,80.0,gas giant,yes,-110.0,yes,0
3,Saturn,58232.0,83.0,gas giant,yes,-140.0,yes,0
4,Uranus,25362.0,27.0,ice giant,yes,-195.0,yes,0
5,Neptune,24622.0,14.0,ice giant,yes,-200.0,yes,0
6,Janssen,,,super earth,no,,,1
7,Tadmor,,,gas giant,,,,1
