# Data Tooling 101 Sandbox

### **What the heck is a Jupyter Notebook?**
A Jupyter Notebook is an interactive code environment that allows you to compartmentalize your code (and even markdown like this!!!) within the context of "cells". Code can be separated and executed in any order, on demand, within a specific cell's scope, but whatever gets executed affects a global scope, allowing you to create new global variables and update them in any order you choose.

### **Example 1**: Executing Cells out of Order

In [45]:
# Execute this first
# Execute this third
try:
    print(an_uninstantiated_variable)
except:
    print("Your variable is uninstantiated!")

Your variable is uninstantiated!


In [2]:
# Execute this second
an_uninstantiated_variable = "I'm instantiated now!"

### **Exercise 1**

Using what you know about Jupyter Notebooks, create a better implementation of the `get_fibonacci_term` implementation provided below as `get_fibonacci_term_faster`. Both functions MUST execute using the timeit decorator, side-by-side, so that the print statements show up next to each other.

Here are some important limitations of this exercise:
* You can only modify the cell that is marked "Exercise 1 Solution"
* Both implementations must be called by the cell that is marked "Exercise 1 Validation"
* You can otherwise execute these cells in any order

In [51]:
def timeit(func):
    import time
    import traceback
    def wrapper(*args, **kwargs):
        before = time.time()
        result = func(*args, **kwargs)
        after = time.time()
        print(f"{func.__name__} took {after - before} seconds to finish.")
        return result
    return wrapper

@timeit
def get_fibonacci_term(n):
    def recursive(n):
        if n <= 2:
            return 1
        return recursive(n - 1) + recursive(n - 2)
    return recursive(n)

In [42]:
funcs = [get_fibonacci_term]

In [44]:
# Exercise 1 Validation
for func in funcs:
    print(func(30))

get_fibonacci_term took 0.16547536849975586 seconds to finish with the following arguments: (30,).
832040


In [37]:
# Exercise 1 Solution

### **Example 2**: Intro to NumPy and Pandas

NumPy (commonly aliased as `np`) and Pandas (commonly aliased as `pd`) are two of the most, if not the most, essential modules for doing data science work in Python. Consider learning them a hard requirement if you ever want to work with big data. That's NOT an exaggeration.

Python has many optimizations that make it much more performant than Javascript when it comes to array-like data structures. Javascript has strings, arrays, and objects. Similarly, Python has strings, lists, and dictionaries. (There are many more examples but these are the first three comparisons people think of). That being said, Python is a higher-level language and won't be nearly as performant as a Java, C, or C++ when it comes to raw optimization of very large datasets.

To address this, NumPy and Pandas leverage C and C++ under the hood in order to offload the computational grunt work onto those lower level languages. They also introduce three new and very important data structures that build off of our mental model of lists and dictionaries:

1. array (NumPy): A NumPy array is a single-to-multi dimensional container that holds elements of the same data type. NumPy arrays are used for efficient storage, manipulation, and computation of large amounts of numerical data.

2. Series (Pandas): A Pandas Series is a one-dimensional labeled data structure provided by the pandas library in Python. It's similar to a column in a spreadsheet or a one-column database table. A Series consists of an ordered sequence of values and associated index labels, which can be used for efficient data manipulation and analysis.

3. DataFrame (Pandas): A Pandas DataFrame is a two-dimensional labeled data structure provided by the pandas library in Python. It's akin to a table in a relational database or a spreadsheet, where data is organized into rows and columns. DataFrames are versatile and widely used for data manipulation, analysis, and exploration.

In [2]:
import random
import numpy as np
import pandas as pd

In [126]:
# Generating a large list v. a large array of random numbers and multipling it by 2.
l = [random.random()*100 for _ in range(int(1e7))]
a = np.array(l)

@timeit
def multiply_by_n(d,n):
    if type(d) == list:
        return map(lambda x: x * 2, d)
    elif type(d) == np.array:
        return d * n
    return d
    
for d in [l,a]:
    multiply_by_n(d, 2)

multiply_by_n took 3.5762786865234375e-06 seconds to finish.
multiply_by_n took 1.6689300537109375e-06 seconds to finish.


In [9]:
# Creating a dataframe from a list of objects
template = {
    'color': ['red','green','blue','orange','black','white','gray'],
    'car_type': ['compact','sedan','truck','suv','crossover','van','hatchback'],
    'fuel_type': ['gas','diesel','electric','hybrid','hydrogen'],
    'year': [2023,2022,2021,2020,2019,2018,2017],
    'price_range': [10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000],
    'condition': ['new', 'used'],
    'finance_options': ['cash','finance','lease'],
    'make': ['toyota','honda','ford','subaru','chevrolet','kia','gm']
}

dataset = []
for _ in range(int(1e4)):
    d = {}
    for key in template.keys():
        d[key] = template[key][random.randint(0,len(template[key])-1)]
    dataset.append(d)

df = pd.DataFrame(dataset)

# head(n) gives us the first n rows of the dataframe
df.head(10)

Unnamed: 0,color,car_type,fuel_type,year,price_range,condition,finance_options,make
0,orange,truck,hybrid,2018,10000,used,lease,gm
1,green,hatchback,electric,2023,15000,used,lease,gm
2,orange,suv,electric,2017,50000,used,finance,chevrolet
3,orange,suv,hydrogen,2020,45000,used,finance,chevrolet
4,white,sedan,electric,2017,50000,used,finance,gm
5,blue,compact,hydrogen,2017,15000,used,finance,honda
6,black,sedan,gas,2021,10000,used,finance,gm
7,black,crossover,diesel,2018,20000,new,cash,ford
8,white,sedan,gas,2019,30000,new,finance,kia
9,black,truck,gas,2018,40000,used,finance,gm


In [134]:
# A series would represent a single column or row of a dataframe
df.car_type

0         compact
1       hatchback
2       hatchback
3       crossover
4           sedan
          ...    
9995        sedan
9996          van
9997        truck
9998    hatchback
9999      compact
Name: car_type, Length: 10000, dtype: object

### **Example 3**: Applying ETL Processes to a CSV

In the data engineering world, a key acronym that describes the key components of data pipelining is ETL, or Extract, Transform, and Load.

1. Extract: Grabbing data from one or more sources of data (typically but not always external) to serve as the raw base data for you to change as part of a data pipeline.

2. Transform: How are we going to combine and process the data we've extracted so that it provides new value to our stakeholders? Common transformations might include filtering, merging, aggregating, and labeling, among others.

3. Load: Taking the data that we extracted and/or transformed and delivering it to some internal resource we own, that can be accessed by key stakeholders within our team, department, or organization.

ETL processes can be ad-hoc or periodic. They can be manually executed or scheduled. They could involve local machines, bare metal servers, or cloud-based resources. They can be in the form of databases, CSVs, or flat files.

In [11]:
# Let's extract a sample dataset of cars data saved to this repo.
# We could generate this dynamically, but this static file will
# ensure that we get reproducible results.

path = '../data/sample_cars.csv'
cars = pd.read_csv(path)
cars.head(10)

Unnamed: 0,color,car_type,fuel_type,year,price_range,condition,finance_options,make
0,orange,truck,hybrid,2018,10000,used,lease,gm
1,green,hatchback,electric,2023,15000,used,lease,gm
2,orange,suv,electric,2017,50000,used,finance,chevrolet
3,orange,suv,hydrogen,2020,45000,used,finance,chevrolet
4,white,sedan,electric,2017,50000,used,finance,gm
5,blue,compact,hydrogen,2017,15000,used,finance,honda
6,black,sedan,gas,2021,10000,used,finance,gm
7,black,crossover,diesel,2018,20000,new,cash,ford
8,white,sedan,gas,2019,30000,new,finance,kia
9,black,truck,gas,2018,40000,used,finance,gm


In [12]:
# Filtering using masks
# If we want to have multiple filters applied as AND, use & between them
# If we want to have multiple filters applied as OR, use | between them
is_truck = cars.car_type == 'truck'
cars[is_truck]

Unnamed: 0,color,car_type,fuel_type,year,price_range,condition,finance_options,make
0,orange,truck,hybrid,2018,10000,used,lease,gm
9,black,truck,gas,2018,40000,used,finance,gm
13,orange,truck,electric,2018,10000,used,finance,kia
31,green,truck,gas,2021,25000,used,cash,gm
39,blue,truck,diesel,2020,45000,new,cash,subaru
...,...,...,...,...,...,...,...,...
9965,blue,truck,hydrogen,2017,35000,new,cash,honda
9976,blue,truck,electric,2017,35000,used,lease,ford
9987,green,truck,diesel,2023,10000,new,lease,kia
9988,black,truck,hydrogen,2023,50000,used,finance,gm


### **Exercise 2**
Using what we learned about filters and masks, can you filter this dataset to only include cars that are under $20,000 or otherwise can be financed (not leased)?

In [22]:
# Exercise 2 Solution

In [28]:
# Aggregations and stats
cars.groupby("color").count()

Unnamed: 0_level_0,car_type,fuel_type,year,price_range,condition,finance_options,make
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
black,1484,1484,1484,1484,1484,1484,1484
blue,1388,1388,1388,1388,1388,1388,1388
gray,1376,1376,1376,1376,1376,1376,1376
green,1441,1441,1441,1441,1441,1441,1441
orange,1394,1394,1394,1394,1394,1394,1394
red,1447,1447,1447,1447,1447,1447,1447
white,1470,1470,1470,1470,1470,1470,1470
