# Intermediate Python

These are my notes for DataCamp's course [_Intermediate Python_](https://www.datacamp.com/courses/intermediate-python).

This course is presented by Hugo Bowne-Anderson, formerly Data Scientist at DataCamp. Collaborators are Filip Schouwenaars and Vincent Vankrunkelsven.

Prerequisite:

- [_Introduction to Python_](../Introduction%20to%20Python/Introduction%20to%20Python.ipynb)

This course is part of these tracks:

- Data Analyst with Python
- Data Engineer
- Data Scientist with Python
- Data Scientist Professional with Python
- Python Fundamentals
- Python Programmer

## Datasets

| Name | File |
| :--- | :--- |
| Gapminder | gapminder.csv |
| Cars | cars.csv |
| BRICS | brics.csv

## Versions

For this notebook, I used:

- Python 3.12.7
- matplotlib 3.9.2
- numpy 2.1.3
- pandas 2.2.3

## Imports

Imports for the entire notebook are placed here for convenience and clarity.

Set the copy on write mode for pandas; this will be the default in pandas 3.0.

In [None]:
import csv

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
pd.set_option("mode.copy_on_write", True)

## Matplotlib

Data visualization, an important early step in data analysis, is used to explore the data to look for obvious insights and correlations.

See https://matplotlib.org/stable/tutorials/introductory/customizing.html for how to customize Matplotlib. Here, I make the figures larger than the default size.

In [None]:
# Find the location of the matplotlibrc file.
print(matplotlib.matplotlib_fname())

In [None]:
# Determine the default plot size and increase it.
# The default size is (6.4, 4.8).
print(plt.rcParams["figure.figsize"])

In [None]:
# Increase the figure size.
plt.rc("figure", figsize=(12.0, 9.0))
# plt.rcParams["figure.figsize"] = (12.0, 9.0)
print(plt.rcParams["figure.figsize"])

### Basic Plots

In [None]:
# Create a linear plot.
year_example = [1950, 1970, 1990, 2010]
pop_example = [2.519, 3.692, 5.263, 6.972]
plt.plot(year_example, pop_example)
plt.show()

In [None]:
# Create a scatter plot.
plt.scatter(year_example, pop_example)
plt.show()

#### Exercises

In [None]:
# I copied these lists from the IPython Shell window.
year = [1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027, 2028, 2029, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2040, 2041, 2042, 2043, 2044, 2045, 2046, 2047, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, 2058, 2059, 2060, 2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 2071, 2072, 2073, 2074, 2075, 2076, 2077, 2078, 2079, 2080, 2081, 2082, 2083, 2084, 2085, 2086, 2087, 2088, 2089, 2090, 2091, 2092, 2093, 2094, 2095, 2096, 2097, 2098, 2099, 2100]
pop_for_year = [2.53, 2.57, 2.62, 2.67, 2.71, 2.76, 2.81, 2.86, 2.92, 2.97, 3.03, 3.08, 3.14, 3.2, 3.26, 3.33, 3.4, 3.47, 3.54, 3.62, 3.69, 3.77, 3.84, 3.92, 4.0, 4.07, 4.15, 4.22, 4.3, 4.37, 4.45, 4.53, 4.61, 4.69, 4.78, 4.86, 4.95, 5.05, 5.14, 5.23, 5.32, 5.41, 5.49, 5.58, 5.66, 5.74, 5.82, 5.9, 5.98, 6.05, 6.13, 6.2, 6.28, 6.36, 6.44, 6.51, 6.59, 6.67, 6.75, 6.83, 6.92, 7.0, 7.08, 7.16, 7.24, 7.32, 7.4, 7.48, 7.56, 7.64, 7.72, 7.79, 7.87, 7.94, 8.01, 8.08, 8.15, 8.22, 8.29, 8.36, 8.42, 8.49, 8.56, 8.62, 8.68, 8.74, 8.8, 8.86, 8.92, 8.98, 9.04, 9.09, 9.15, 9.2, 9.26, 9.31, 9.36, 9.41, 9.46, 9.5, 9.55, 9.6, 9.64, 9.68, 9.73, 9.77, 9.81, 9.85, 9.88, 9.92, 9.96, 9.99, 10.03, 10.06, 10.09, 10.13, 10.16, 10.19, 10.22, 10.25, 10.28, 10.31, 10.33, 10.36, 10.38, 10.41, 10.43, 10.46, 10.48, 10.5, 10.52, 10.55, 10.57, 10.59, 10.61, 10.63, 10.65, 10.66, 10.68, 10.7, 10.72, 10.73, 10.75, 10.77, 10.78, 10.79, 10.81, 10.82, 10.83, 10.84, 10.85]
plt.plot(year, pop_for_year)
plt.show()

In [None]:
# The data were collected by Hans Rosling to build his bubble chart.
#   life_exp: life expectancy for each country, in years
#   gdp_cap: gross domestic product per capita, 2007, in US dollars
#   pop: in millions of persons
#   cont: continent
# All of the data are contained in the gapminder.csv data file.
# The fields are:
#   ,country,year,population,cont,life_exp,gdp_cap

# Import the data into lists.
# See the Introduction to Python.ipynb file for an example of how to do this.
pop_for_country = []
cont = []
life_exp = []
gdp_cap = []
with open("gapminder.csv", newline="") as csv_file:
    csvreader = csv.reader(csv_file)
    header = next(csvreader)
    for row in csvreader:
        pop_for_country.append(float(row[3]))
        cont.append(str(row[4]))
        life_exp.append(float(row[5]))
        gdp_cap.append(float(row[6]))

# Convert population to millions using a NumPy array.
np_pop = np.array(pop_for_country)
np_pop = np_pop / 1000000

In [None]:
# Scatter plot of life expectancy (dependent variable)
# vs. GDP per capita (independent variable).
plt.scatter(gdp_cap, life_exp)
plt.xscale("log")
plt.show()

In [None]:
# Scatter plot of life expectancy (dependent variable)
# vs. population (independent variable).
plt.scatter(np_pop, life_exp)
plt.xscale("log")
plt.show()

### Histograms

Histograms are useful during data exploration to visualize the distribution of the data. The most important arguments are x, the data, and bins, the number of bins (10 by default).

In [None]:
# Example:
values = [0.0, 0.6, 1.4, 1.6, 2.2, 2.5, 2.6, 3.2, 3.5, 3.9, 4.2, 6.0]
plt.hist(values, bins=3)
plt.show()

#### Exercises

In [None]:
plt.hist(life_exp) # bins=10 by default
plt.show()

In [None]:
# Try 5 bins and 20 bins.
plt.hist(life_exp, bins=5)
plt.show()

In [None]:
plt.hist(life_exp, bins=20)
plt.show()

In [None]:
plt.hist(life_exp, bins=15)
plt.show()
life_exp1950 = [28.8, 55.23, 43.08, 30.02, 62.48, 69.12, 66.8, 50.94, 37.48, 68.0, 38.22, 40.41, 53.82, 47.62, 50.92, 59.6, 31.98, 39.03, 39.42, 38.52, 68.75, 35.46, 38.09, 54.74, 44.0, 50.64, 40.72, 39.14, 42.11, 57.21, 40.48, 61.21, 59.42, 66.87, 70.78, 34.81, 45.93, 48.36, 41.89, 45.26, 34.48, 35.93, 34.08, 66.55, 67.41, 37.0, 30.0, 67.5, 43.15, 65.86, 42.02, 33.61, 32.5, 37.58, 41.91, 60.96, 64.03, 72.49, 37.37, 37.47, 44.87, 45.32, 66.91, 65.39, 65.94, 58.53, 63.03, 43.16, 42.27, 50.06, 47.45, 55.56, 55.93, 42.14, 38.48, 42.72, 36.68, 36.26, 48.46, 33.68, 40.54, 50.99, 50.79, 42.24, 59.16, 42.87, 31.29, 36.32, 41.72, 36.16, 72.13, 69.39, 42.31, 37.44, 36.32, 72.67, 37.58, 43.44, 55.19, 62.65, 43.9, 47.75, 61.31, 59.82, 64.28, 52.72, 61.05, 40.0, 46.47, 39.88, 37.28, 58.0, 30.33, 60.4, 64.36, 65.57, 32.98, 45.01, 64.94, 57.59, 38.64, 41.41, 71.86, 69.62, 45.88, 58.5, 41.22, 50.85, 38.6, 59.1, 44.6, 43.58, 39.98, 69.18, 68.44, 66.07, 55.09, 40.41, 43.16, 32.55, 42.04, 48.45]
plt.hist(life_exp1950, bins=15)
plt.show()

In [None]:
plt.hist(life_exp1950, bins=15, alpha=0.3)
plt.hist(life_exp, bins=15, alpha=0.3)
plt.show()

### Customizing Plots

In [None]:
# Add more data for earlier years.
year_ext = [1800, 1850, 1900] + year
pop_ext = [1.0, 1.262, 1.650] + pop_for_year
# Create a line plot.
plt.plot(year_ext, pop_ext)
# Add axis labels.
plt.xlabel("Year")
plt.ylabel("Population (billions)")
# Add a title.
plt.title("World Population Projections")
# Start the y axis at 0 with string labels for each tick.
plt.yticks((0, 2, 4, 6, 8, 10, 12))
# Add a label for each y tick. (This is crude, so modified the y label instead.)
# plt.yticks((0, 2, 4, 6, 8, 10, 12), ("0B", "2B", "4B", "6B", "8B", "10B", "12B"))
plt.show()

#### Exercises

This work continues to enhance to world development data, plotting GDP per capita on the x-axis (logarithmic scale) vs. life expectancy on the y-axis.

In [None]:
# Provide a title and axis labels.
plt.scatter(gdp_cap, life_exp)
plt.xscale("log")
xlab = "GDP per Capita (USD)"
ylab = "Life Expectancy (years)"
title = "World Development in 2007"
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)
plt.show()

In [None]:
# Label x ticks.
plt.scatter(gdp_cap, life_exp)
plt.title("World Development in 2007")
plt.xscale("log")
plt.xlabel("GDP per Capita (USD)")
tick_val = (100, 1000, 10000, 100000)
tick_lab = ("0.1k", "1k", "10k", "100k")
plt.xticks(tick_val, tick_lab)
plt.ylabel("Life Expectancy (years)")
plt.show()

In [None]:
# Size scatter plot dots to reflect population size.
plt.scatter(gdp_cap, life_exp, s=np_pop)
plt.title("World Development in 2007")
plt.xscale("log")
plt.xlabel("GDP per Capita (USD)")
tick_val = (100, 1000, 10000, 100000)
tick_lab = ("0.1k", "1k", "10k", "100k")
plt.xticks(tick_val, tick_lab)
plt.ylabel("Life Expectancy (years)")
plt.show()

In [None]:
# Double the sizes of the scatter plot dots.
pop_dot_size = 2 * np_pop
plt.scatter(gdp_cap, life_exp, s=pop_dot_size)
plt.title("World Development in 2007")
plt.xscale("log")
plt.xlabel("GDP per Capita (USD)")
tick_val = (100, 1000, 10000, 100000)
tick_lab = ("0.1k", "1k", "10k", "100k")
plt.xticks(tick_val, tick_lab)
plt.ylabel("Life Expectancy (years)")
plt.show()

In [None]:
# Create a list of colors from the continents for each data row.
# Yellow is not a good color; I changed it to purple.
cont_colors = {
    'Asia': 'red',
    'Europe': 'green',
    'Africa': 'blue',
    'Americas': 'purple',
    'Oceania': 'black'
}
# Use a list comprehension to create a list of colors.
colors = [cont_colors[x] for x in cont]

In [None]:
# Color the scatter plot dots.
plt.scatter(gdp_cap, life_exp, s=pop_dot_size, c=colors, alpha=0.5)
plt.title("World Development in 2007")
plt.xscale("log")
plt.xlabel("GDP per Capita (USD)")
tick_val = (100, 1000, 10000, 100000)
tick_lab = ("0.1k", "1k", "10k", "100k")
plt.xticks(tick_val, tick_lab)
plt.ylabel("Life Expectancy (years)")
plt.show()

In [None]:
# Add more customizations.
# Provide labels for China and India, and add a grid.
plt.scatter(gdp_cap, life_exp, s=pop_dot_size, c=colors, alpha=0.5)
plt.title("World Development in 2007")
plt.grid(True)
plt.xscale("log")
plt.xlabel("GDP per Capita (USD)")
tick_val = (100, 1000, 10000, 100000)
tick_lab = ("0.1k", "1k", "10k", "100k")
plt.xticks(tick_val, tick_lab)
plt.ylabel("Life Expectancy (years)")
plt.text(2000, 68, "India")
plt.text(4000, 76.5, "China")
plt.show()

## Dictionaries

### Dictionaries, Part 1

Dictionary keys must be immutable objects (strings and ints are the usual keys).

#### Exercises

In [None]:
# Indexing parallel lists to get to values for keys.
countries = ["spain", "france", "germany", "norway"]
capitals = ["madrid", "paris", "berlin", "oslo"]
ind_ger = countries.index("germany")
print(capitals[ind_ger])

In [None]:
# Create a dictionary to set the key-value pairs.
europe = {"spain": "madrid", "france": "paris", "germany": "berlin", "norway": "oslo"}
print(europe)

In [None]:
# Print keys and values, and get a value for a specific key.
print(europe.keys())
print(europe.values())
print(europe["norway"])

### Dictionaries, Part 2

In [None]:
# Add a new key and value to an existing dictionary.
world = {"afghanistan": 30.55, "albania": 2.81, "algeria": 39.21}
print(world)
world["sealand"] = 0.000027
print(world)
# See if a key is present in a dictionary.
print("sealand" in world)
# See if a value is present in a dictionary.
print(30.55 in world.values())

In [None]:
# Update a key-value pair.
world["sealand"] = 0.000028
print(world)

In [None]:
# Delete a key-value pair.
del(world["sealand"])
print(world)

#### Exercises

In [None]:
# Add key-value pairs.
europe["italy"] = "rome"
print("italy" in europe)
europe["poland"] = "warsaw"
print(europe)

In [None]:
# Fix errors in a dictionary.
europe2 = {"spain": "madrid", "france": "paris", "germany": "bonn", "norway": "oslo",
           "italy": "rome", "poland": "warsaw", "australia": "vienna"}
europe2["germany"] = "berlin"
del(europe2["australia"])
print(europe2)

In [None]:
# Dictionaries of dictionaries.
europe3 = {"spain": {"capital": "madrid", "population": 46.77},
          "france": {"capital": "paris", "population": 66.03},
          "germany": {"capital": "berlin", "population": 80.62},
          "norway": {"capital": "oslo", "population": 5.084}}
print(europe3["france"]["capital"])
italy = {"capital": "rome", "population": 59.83}
europe3["italy"] = italy
print(europe3)

## Pandas

### Create DataFrames

Pandas is a high-level data manipulation tool that creates tables of data where the columns can have different data types. It is more flexible than using a NumPy 2D array. Pandas is built on top of the NumPy package. The Pandas data table is a DataFrame, in which rows and columns have unique labels.

#### DataFrame()

In [None]:
# Build a DataFrame from a dictionary.
brics_dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
              "capital": ["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
              "area": [8.516, 17.10, 3.286, 9.597, 1.221],
              "population": [200.4, 143.5, 1252, 1357, 52.98]}
brics_df = pd.DataFrame(brics_dict)
brics_df.index = ["BR", "RU", "IN", "CH", "SA"]
print(brics_df)

In [None]:
# Something I dislike about Pandas is inconsistent naming.
print(brics_df.index) # row names
print(brics_df.columns) # column names
print(brics_df.dtypes) # data types

#### read_csv()

Use the `read_csv` method to read data from a CSV file.

In [None]:
# Load the data from a CSV file.
brics_df2 = pd.read_csv("brics.csv", header=0, index_col=0)
print(brics_df2)

#### Exercises

In [None]:
# Build a DataFrame from lists and a dict.
# The dict contains the column labels and values of the DataFrame.
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
my_dict = {"cars_per_cap": cpc, "country": names, "drives_right": dr}
cars = pd.DataFrame(my_dict)
# Set the index values (row labels).
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']
cars.index = row_labels
print(cars)

In [None]:
# Create a dataframe from my_dict and row_labels all at once.
cars2 = pd.DataFrame(my_dict, row_labels)
print(cars2)
# Test that cars and cars2 contain the same elements.
print(cars2.equals(cars))

In [None]:
# Import data from the cars.csv file.
cars3 = pd.read_csv("cars.csv", index_col=0)
print(cars3)
# Test that cars and cars3 contain the same elements.
print(cars3.equals(cars))

### Index and Select Data

#### Using []

Index data using square brackets, although this provides limited functionality.
The documentation recommends using .iat, .iloc, .at, or .loc.

In [None]:
# For convenience, print brics_df2.
print(brics_df2)

In [None]:
# Indexing a column with square brackets returns a Series object.
print(type(brics_df2))
country_series = brics_df2["country"]
print(country_series)
print(type(country_series))

In [None]:
# Use [[]] to extract a column as a DataFrame object.
# This is actually indexing the columns using a list of column names.
country_df = brics_df2[["country"]]
print(country_df)
print(type(country_df))

In [None]:
# Use [[]] to extract multiple columns as a DataFrame object.
country_capital_df = brics_df2[["country", "capital"]]
print(country_capital_df)
print(type(country_capital_df))

In [None]:
# Can a tuple be used for indexing? It makes sense, but it doesn't work.
# country_capital_df2 = brics_df2[("country", "capital")]
# print(country_capital_df2)
# print(type(country_capital_df2))

In [None]:
# Extract rows using slice indexing; this returns a DataFrame object.
brics_df3 = brics_df2[1:4]
print(brics_df3)
print(type(brics_df3))

In [None]:
# Try other slices. Using [0] or [0, 1] does not work.
print(brics_df2[0:1])
print()
print(brics_df2[0:2])

#### Using loc

Access DataFrame elements with labels using `loc`. `loc` needs a row label and a column label or lists of row and column labels. `loc`  returns different objects depending on the indexes.

In [None]:
# Return a Series using a row label.
ru_series = brics_df2.loc["RU"]
print(ru_series)
print(type(ru_series))

In [None]:
# Use a list containing the row label to return a DataFrame object.
ru_df = brics_df2.loc[["RU"]]
print(ru_df)
print(type(ru_df))

In [None]:
# Return multiple rows as a DataFrame.
brics_df4 = brics_df2.loc[["RU", "IN", "CH"]]
print(brics_df4)
print(type(brics_df4))

In [None]:
# Use lists of row and column labels to return a subset of rows and columns as a DataFrame object.
brics_df5 = brics_df2.loc[["RU", "IN", "CH"], ["country", "capital"]]
print(brics_df5)
print(type(brics_df5))

In [None]:
# Using row and column indexes returns a single item.
ru_country = brics_df2.loc["RU", "country"]
print(ru_country)
print(type(ru_country))

In [None]:
# Use lists containing a single row label and a single column label
# to return a DataFrame containing a single row and column.
ru_country_df = brics_df2.loc[["RU"], ["country"]]
print(ru_country_df)
print(type(ru_country_df))

In [None]:
# Returning one row or column with more than one item returns a Series object.
ru_in_countries_series = brics_df2.loc[["RU", "IN"], "country"]
print(ru_in_countries_series)
print(type(ru_in_countries_series))

In [None]:
# Index using a list of column names to get a DataFrame object.
ru_in_countries_df = brics_df2.loc[["RU", "IN"], ["country"]]
print(ru_in_countries_df)
print(type(ru_in_countries_df))

In [None]:
# Return a Series containing one row and multiple columns.
ru_country_capital_series = brics_df2.loc["RU", ["country", "capital"]]
print(ru_country_capital_series)
print(type(ru_country_capital_series))

In [None]:
# Use two lists to return a DataFrame containing one row and multiple columns.
ru_country_capital_df = brics_df2.loc[["RU"], ["country", "capital"]]
print(ru_country_capital_df)
print(type(ru_country_capital_df))

In [None]:
# Get all rows and a subset of columns.
brics_df6 = brics_df2.loc[:, ["country", "capital"]]
print(brics_df6)
print(type(brics_df6))

In [None]:
# Get a subset of rows and all columns using row labels.
brics_df7 = brics_df2.loc[["RU", "CH"]]
print(brics_df7)
print(type(brics_df7))

In [None]:
# Get a subset of rows and all columns using : to specify all columns.
brics_df8 = brics_df2.loc[["RU", "CH"], :]
print(brics_df8)
print(type(brics_df8))

#### Using iloc

Use `iloc` with numeric indexes to subset a DataFrame.

In [None]:
# Return a Series containing a single row.
brazil_series = brics_df2.iloc[0]
print(brazil_series)
print(type(brazil_series))

In [None]:
# Return a DataFrame containing a single row and all columns.
brazil_df = brics_df2.iloc[[0]]
print(brazil_df)
print(type(brazil_df))

In [None]:
# Return a DataFrame containing a single row and all columns.
# This is more explicit.
brazil_df2 = brics_df2.iloc[[0], :]
print(brazil_df2)
print(type(brazil_df2))

In [None]:
# Return a DataFrame containing multiple rows and all columns.
br_ru_df = brics_df2.iloc[[0, 1]]
print(br_ru_df)
print(type(br_ru_df))

In [None]:
# Return a DataFrame containing multiple rows and all columns.
# This is more explicit.
br_ru_df2 = brics_df2.iloc[[0, 1], :]
print(br_ru_df2)
print(type(br_ru_df2))

In [None]:
# Return a DataFrame containing multiple rows and all columns.
br_ru_df3 = brics_df2.iloc[0:2]
print(br_ru_df3)
print(type(br_ru_df3))
print(br_ru_df2.equals(br_ru_df3))

In [None]:
# Return a DataFrame containing multiple rows and all columns.
# This is more explicit.
br_ru_df4 = brics_df2.iloc[0:2, :]
print(br_ru_df4)
print(type(br_ru_df4))
print(br_ru_df2.equals(br_ru_df4))

In [None]:
# Return the value of an individual cell of the DataFrame.
br_country = brics_df2.iloc[0, 0]
print(br_country)
print(type(br_country))

In [None]:
# Return a DataFrame containing one row and column.
br_country_df = brics_df2.iloc[[0], [0]]
print(br_country_df)
print(type(br_country_df))

In [None]:
# Return a DataFrame with multiple rows and columns.
brics_df10 = brics_df2.iloc[[0, 2], [1, 3]]
print(brics_df10)
print(type(brics_df10))

In [None]:
# Return all rows and a subset of columns.
brics_df11 = brics_df2.iloc[:, [1, 3]]
print(brics_df11)
print(type(brics_df11))

#### Exercises

In [None]:
# Create subsets of the cars2 DataFrame as a Series and a DataFrame.
print(cars2)
print()
print(cars2["country"])
print()
print(cars2[["country"]])
print()
# Print a DataFrame containing the "country" and "drives_right" columns.
print(cars2[["country", "drives_right"]])

In [None]:
# Row indexing is limited to slices when not using the iloc method.
# Print the first three rows.
print(cars2[0:3])
print()
# Print rows 4-6.
print(cars2[3:6])

In [None]:
# Indexing using loc.
print(cars2.loc["JPN"])
print()
print(cars2.loc[["JPN"]])
print()
print(cars2.loc[["AUS", "EG"]])

In [None]:
# Subsets of cars2.
# Print out drives_right value of Morocco.
print(cars2.loc["MOR", "drives_right"])
print()
# Print sub-DataFrame.
print(cars2.loc[["RU", "MOR"], ["country", "drives_right"]])

In [None]:
# More subsets.
# Print out drives_right column as Series
print(cars2.loc[:, "drives_right"])
print()
# Print out drives_right column as DataFrame
print(cars2.loc[:, ["drives_right"]])
print()
# Print out cars_per_cap and drives_right as DataFrame
print(cars2.loc[:, ["cars_per_cap", "drives_right"]])

## Logic, Control Flow, and Filtering

### Comparison Operators

In [None]:
#  Compare NumPy arrays.
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
print(my_house >= 18)
print(my_house < your_house)

### Boolean Operators

The boolean operators are `and`, `or`, and `not`. Using boolean operators with the results of comparison on NumPy arrays doesn't work. Use the `numpy.logical_and`, `numpy.logical_or`, and `numpy.logical_not` methods.

In [None]:
bmi = np.array([21.852, 21.75, 24.747, 21.441])
# The result of a boolean comparison is a numpy.ndarray object.
lt = bmi > 21
print(type(lt))
print(bmi > 21)
print(bmi < 22)
# This doesn't work:
# ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
# print(bmi > 21 and bmi < 22)

In [None]:
# Use numpy logical_and for pairwise comparison of elements of NumPy arrays.
print(np.logical_and(bmi > 21, bmi < 22))

In [None]:
print(bmi[np.logical_and(bmi > 21, bmi < 22)])

#### Exercises

In [None]:
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house > 18.5, my_house < 10))

# Both my_house and your_house smaller than 11
print(np.logical_and(my_house < 11, your_house < 11))

### if, elif, else

In [None]:
for z in range(1, 6):
    print("Checking " + str(z) + "...")
    if z % 2 == 0:
        print("z is divisible by 2")
    elif z % 3 == 0:
        print("z is divisble by 3")
    else:
        print("z is divisible by neither 2 nor 3")

#### Exercises

The exercises are trivial and not included here.

### Filtering Pandas DataFrames

In [None]:
# Find the countries in brics.csv where the area is greater than 8 million square km.
# brics_df2["area"], brics_df2.loc[:, "area"], and brics_df2[:, 2] all return the same Series.
# This finds the desired rows in the DataFrame.
is_huge = brics_df2.loc[:, "area"] > 8
is_huge
print(type(is_huge))

In [None]:
# Filter the data. Note that Jupyter/IPhython formats the output nicely.
brics_df2[is_huge]

In [None]:
# Combine into a one-liner.
brics_df2[brics_df2.loc[:, "area"] > 8]

In [None]:
# More filtering as a one-liner using a boolean operator.
brics_df2[np.logical_and(brics_df2.loc[:, "area"] > 8, brics_df2.loc[:, "area"] < 10)]

#### Exercises

In [None]:
# Filter the cars2 data.
# cars_per_cap is cars per 1000 people.
cars2

In [None]:
# Create the filter:
drives_right = cars2.loc[:, "drives_right"] == True
drives_right

In [None]:
# Apply the filter.
cars2[drives_right]

In [None]:
# Apply the filter the opposite way.
drives_left = cars2.loc[:, "drives_right"] == False
print(drives_left)
cars2[drives_left]

In [None]:
# One-liner for countries where drivers drive on the right.
cars2[cars2.loc[:, "drives_right"] == True]

In [None]:
# Filter the rows where cars_per_cap > 500.
cars2[cars2.loc[:, "cars_per_cap"] > 500]

In [None]:
# One-liner to filter the rows where cars_per_cap >= 100 and cars_per_cap <= 500.
cars2[np.logical_and(
    cars2.loc[:, "cars_per_cap"] >= 100,
    cars2.loc[:, "cars_per_cap"] <= 500)]

## Loops

### while

A `while` loop behaves like a repeated `if` statement. A `while` loop is best for repeating an action until a condition is met.

#### Exercises

In [None]:
# A simple while loop.
offset = 8
print(offset)
while offset != 0:
    print("correcting...")
    offset = offset - 1
    print(offset)

In [None]:
# A simple while loop with more complicated testing.
offset = -6
while offset != 0:
    print("correcting...")
    if offset > 0:
      offset = offset - 1
    else : 
      offset = offset + 1
    print(offset)

### for

In [None]:
# A simple for loop.
fam = [1.73, 1.68, 1.71, 1.89]
for height in fam:
    print(height)

In [None]:
# A for loop with enumeration.
for index, height in enumerate(fam):
    print("index " + str(index) + ": " + str(height))

In [None]:
# Use a for loop with a string.
for c in "family":
    print(c.capitalize())

#### Exercises

In [None]:
# A simple for loop.
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for area in areas:
    print(area)

In [None]:
# A simple loop with enumeration.
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
for index, a in enumerate(areas):
    print("room " + str(index + 1) + ": " + str(a))

In [None]:
# A for loop with a list of lists.
house = [["hallway", 11.25], 
         ["kitchen", 18.0], 
         ["living room", 20.0], 
         ["bedroom", 10.75], 
         ["bathroom", 9.50]]
for room in house:
    print("the " + room[0] + " is " + str(room[1]) + " sqm")

### Loop Data Structures, Part 1

#### Iterate Through a Dictionary

In [None]:
# Iterate through a dictionary:
world = {"afghanistan": 30.55, "albania": 2.77, "algeria": 39.21}
for key, value in world.items():
    print(str(key) + ": " + str(value))

#### Iterate Through numpy Arrays

In [None]:
# Iterate through numpy arrays.
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi:
    print(val)

In [None]:
meas = np.array([np_height, np_weight])
print(meas)

In [None]:
# This prints each 1D array in the 2D array.
for val in meas:
    print(val)

In [None]:
# Iterate through the individual elements in the 2D array.
for val in np.nditer(meas):
    print(val)

#### Exercises

In [None]:
europe3 = {'spain':'madrid', 'france':'paris', 'germany':'berlin',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'austria':'vienna'}
for key, val in europe3.items():
    print("the capital of " + key + " is " + val)

In [None]:
# The next exercise uses the data from baseball.csv (see Introduction to Python).
# Iterate over items in a numpy.ndarray containing one column:
# for x in np_array:
#     print(str(x) + " inches"

# Iterate over items in a numpy.ndarray containing two columns:
# for x in np.nditer(np_array):
#     print(x)

### Loop Data Structures, Part 2

In [None]:
# Review brics_df2 DataFrame.
brics_df2

#### Iterate Through a pandas DataFrame

In [None]:
# Looping through a pandas DataFrame.
# This prints the column names.
for val in brics_df2:
    print(val)

In [None]:
# Iterate through the column names:
for val in brics_df2.columns:
    print(val)

In [None]:
# Iterate through the row names:
for val in brics_df2.index:
    print(val)

In [None]:
# Use the iterrows method to iterate through the rows of a pandas DataFrame.
# This returns a row label and a pandas.Series object for each iteration.
for label, row in brics_df2.iterrows():
    print(label)
    print(type(row))
    print(row)
    print()

In [None]:
# Iterate through the DataFrame and print only the capitals.
for lab, row in brics_df2.iterrows():
    print(lab + ": " + row["capital"])

#### Add a Column to a DataFrame

In [None]:
# Add a column to the DataFrame containing the length of the name.
# When a new column Series is created but there is no value for the row,
# the value is set to NaN.
# The lengths are turned into floats; is this because the Series contains
# NaN values until it is completely filled?
# This is not efficient because a new Series object is created on each iteration
# (is that true?). I think this means the underlying numpy.ndarray is
# created on each iteration.
for lab, row in brics_df2.iterrows():
    brics_df2.loc[lab, "name_length"] = int(len(row["country"]))
    # print("lab: " + lab + " row: " + str(row))
    # print(brics_df2)
    # print()
brics_df2

In [None]:
# Building the new column (Series) of the DataFrame can be performed
# more efficiently by using the apply method of the Series object.
# The iteration is within apply.
# The values of the new column (Series) are set to ints.
brics_df2["name_length2"] = brics_df2["country"].apply(len)
brics_df2

#### Exercises

In [None]:
# Review the cars2 DataFrame.
cars2

In [None]:
# Iterate through cars2 to print the label and data of each row.
for label, data in cars2.iterrows():
    print(label)
    print(data)

In [None]:
# Print the label and the cars_per_cap value for each row.
for lab, row in cars2.iterrows():
    print(lab + ": " + str(row["cars_per_cap"]))

In [None]:
# Inefficiently create a COUNTRY column in the DataFrame.
for lab, row in cars2.iterrows():
    cars2.loc[lab, "COUNTRY"] = row.loc["country"].upper()
cars2

In [None]:
# Use the apply method to efficiently create a new column.
cars2["COUNTRY2"] = cars2["country"].apply(str.upper)
cars2

## Hacker Statisics

This is a random walk simulation. You are climbing the steps inside the Empire State Building. You roll a die 100 times to determine your next move. Given the following rules, what is the probability that you'll reach Step 60?

- If you roll 1 or 2, you go down one step.
- If you roll 3, 4, or 5, you go up one step.
- If you roll 6, you roll the die again and climb the number of steps determined by the roll.
- There is a 0.1% chance (one in a thousand) that you'll fall down the steps, having to restart at the beginning.

This is "hacker statistics" because it takes a Monte Carlo approach to estimate the probability. Thousands of runs are generated, with success or failure recorded for each run, and the probability of success is estimated from the proportion of successes.

### Random Numbers

In [None]:
# Initialize the pseudorandom number generator and obtain the first two numbers
# from a uniform distribution.
np.random.seed(123)
print(np.random.rand())
print(np.random.rand())

### Preliminary Work

I wrote functions for rolling a die, falling, climbing or descending, and processing a climb event. I wrote functions for testing these functions.

In [None]:
# My work.
# Functions for rolling, climbing, falling, and running a climb event.
def roll_die():
    """
    Return a die roll (1, 2, 3, 4, 5, or 6).
    """
    # return int(np.random.rand() * 6 + 1)
    return np.random.randint(1, 7)

def fall():
    """
    The probability of falling is 0.1%. Return True if the
    random number indicates falling, False otherwise.
    """
    if np.random.rand() < 0.001:
        return True
    else:
        return False

def climb():
    steps = None
    roll = roll_die()
    if roll == 1:
        steps = -1
    elif roll == 2:
        steps = -1
    elif roll == 3:
        steps = 1
    elif roll == 4:
        steps = 1
    elif roll == 5:
        steps = 1
    elif roll == 6:
        steps = roll_die()
    return steps

def do_climb():
    """
    Roll the die 100 times and return the final step reached.
    Make sure step is never less than 0.
    """
    iterations = 100
    step = 0
    for i in range(0, iterations):
        if fall():
            step = 0
        else:
            step += climb()
            if step < 0:
                step = 0
    return step

In [None]:
# My work.
# Functions for testing the above functions.
def test_roll_die():
    """
    Use 60,000 iterations to test the distribution of rolls
    generated by the roll_die function.
    """
    iterations = 60000
    rolls = {}
    for i in range(0, iterations):
        roll = roll_die()
        if roll not in rolls:
            rolls[roll] = 1
        else:
            rolls[roll] += 1
    print("rolls per " + str(iterations) + " iterations:")
    for roll in sorted(rolls.keys()):
        print(str(roll) + ": " + str(rolls[roll]))
    print()

def test_fall():
    """
    Return the number of falls per 1,000,000 events. The value
    should be near 1,000.
    """
    iterations = 1000000
    falls = 0
    for i in range(0, iterations):
        if fall():
            falls += 1
    print("falls per " + str(iterations) + " climbs:")
    print(falls)
    print()

def test_climb():
    """
    Call climb 12,000 times and record the distribution.
    """
    iterations = 12000
    steps_data = {}
    for i in range(0, iterations):
        steps = climb()
        if steps not in steps_data:
            steps_data[steps] = 1
        else:
            steps_data[steps] += 1
    print("steps per " + str(iterations) + " climbs:")
    for steps in sorted(steps_data.keys()):
        print(str(steps) + ": " + str(steps_data[steps]))
    print()

def test_do_climb():
    # Expect to fall 0.001 * 10000 * 100 times, or 1000 times.
    iterations = 10000
    result = {"success": 0, "failure": 0}
    for i in range(0, iterations):
        step = do_climb()
        if step >= 60:
            result["success"] += 1
        else:
            result["failure"] += 1
    print("proportions of success and failure for " + str(iterations) + " climbing runs:")
    print("success: " + str(result["success"] / iterations))
    print("failure: " + str(result["failure"] / iterations))
    print()

test_roll_die()
test_fall()
test_climb()
test_do_climb()

After watching more of the videos, I learned that the requirements are to collect the step positions in a list and plot them using plt.plot().

### Random Walk

The path of a molecule in air or a liquid is well-described by a random walk. A random walk can also be used to simulate a gambler's financial status.

#### Exercises

In [None]:
# This is a modified function that returns the random walk steps from a
# complete random walk so they can be plotted.
def do_random_walk():
    """
    Roll the die 100 times and return a list containing
    the step positions.
    """
    iterations = 100
    random_walk = [0]
    step = None
    for i in range(0, iterations):
        if fall():
            step = 0
        else:
            step = random_walk[-1] + climb()
            if step < 0:
                step = 0
        random_walk.append(step)
    return random_walk

def plot_random_walk(random_walk):
    plt.plot(random_walk)
    plt.show()

# Carry out a climbing run and plot it.
random_walk = do_random_walk()
plot_random_walk(random_walk)

### Distributions

What is the distribution of the number of tails seen if we toss a coin 10 times and repeat this 10,000 times?

In [None]:
final_tails = []
for i in range(10000):
    tails = [0]
    for j in range(10):
        coin = np.random.randint(0, 2)
        tails.append(tails[-1] + coin)
    final_tails.append(tails[-1])

# Collect the distribution in a dictionary and display it.
dist = {}
for result in final_tails:
    if result not in dist:
        dist[result] = 1
    else:
        dist[result] += 1
for bin in sorted(dist.keys()):
    print(str(bin) + ": " + str(dist[bin]))

# Plot the distribution as a histogram.

# This was the plot code from the video. The output did not
# match the histogram in the video.
# plt.hist(final_tails, bins=10)

# I had to experiment with bins and align to get the histogram
# to draw correctly. I made other refinements to the figure.
plt.hist(final_tails, bins=range(0, 12), align="left")
plt.title("Distribution of number of tails per 10 coin flips")
plt.xticks(range(0, 11))
plt.xlabel("Number of tails per 10 coin flips")
plt.ylabel("Observations")
plt.show()

#### Exercises

In [None]:
# Create multiple random walks.
# Plot the random walks.
# Create a histogram of the final positions of the walks.
rw_iterations = 1000
random_walks = []
for i in range(rw_iterations):
    random_walk = do_random_walk()
    random_walks.append(random_walk)
# Create a numpy array and plot it. The results are not what we want.
np_rw = np.array(random_walks)

# Transpose the data and plot it.
# plt.plot creates a line plot from the table columns.
# Here we have columns with 101 rows.
np_rw_t = np_rw.T
# print(np_rw_t.shape) # (101, 1000)

plt.plot(np_rw_t)
plt.show()

# Plot a histogram of the final position of each random walk.
ends = np_rw_t[-1]
max_range = int(((max(ends) / 10) + 1) * 10)
plt.xticks(range(0, max_range, 10))
plt.hist(ends, bins=range(0, max_range, 10))
plt.show()

# Find the number of successful walks.
success = 0
for x in ends:
    if x >= 60:
        success += 1
print(success)
print(success / rw_iterations)