# Data Tooling 101 Sandbox

### **What the heck is a Jupyter Notebook?**
A Jupyter Notebook is an interactive code environment that allows you to compartmentalize your code (and even markdown like this!!!) within the context of "cells". Code can be separated and executed in any order, on demand, within a specific cell's scope, but whatever gets executed affects a global scope, allowing you to create new global variables and update them in any order you choose.

### **Example 1**: Executing Cells out of Order

In [45]:
# Execute this first
# Execute this third
try:
    print(an_uninstantiated_variable)
except:
    print("Your variable is uninstantiated!")

Your variable is uninstantiated!


In [2]:
# Execute this second
an_uninstantiated_variable = "I'm instantiated now!"

### **Exercise 1**

Using what you know about Jupyter Notebooks, create a better implementation of the `get_fibonacci_term` implementation provided below as `get_fibonacci_term_faster`. Both functions MUST execute using the timeit decorator, side-by-side, so that the print statements show up next to each other.

Here are some important limitations of this exercise:
* You can only modify the cell that is marked "Exercise 1 Solution"
* Both implementations must be called by the cell that is marked "Exercise 1 Validation"
* You can otherwise execute these cells in any order

In [51]:
def timeit(func):
    import time
    import traceback
    def wrapper(*args, **kwargs):
        before = time.time()
        result = func(*args, **kwargs)
        after = time.time()
        print(f"{func.__name__} took {after - before} seconds to finish.")
        return result
    return wrapper

@timeit
def get_fibonacci_term(n):
    def recursive(n):
        if n <= 2:
            return 1
        return recursive(n - 1) + recursive(n - 2)
    return recursive(n)

In [42]:
funcs = [get_fibonacci_term]

In [44]:
# Exercise 1 Validation
for func in funcs:
    print(func(30))

get_fibonacci_term took 0.16547536849975586 seconds to finish with the following arguments: (30,).
832040


In [37]:
# Exercise 1 Solution

### **Example 2**: Intro to NumPy and Pandas

NumPy (commonly aliased as `np`) and Pandas (commonly aliased as `pd`) are two of the most, if not the most, essential modules for doing data science work in Python. Consider learning them a hard requirement if you ever want to work with big data. That's NOT an exaggeration.

Python has many optimizations that make it much more performant than Javascript when it comes to array-like data structures. Javascript has strings, arrays, and objects. Similarly, Python has strings, lists, and dictionaries. (There are many more examples but these are the first three comparisons people think of). That being said, Python is a higher-level language and won't be nearly as performant as a Java, C, or C++ when it comes to raw optimization of very large datasets.

To address this, NumPy and Pandas leverage C and C++ under the hood in order to offload the computational grunt work onto those lower level languages. They also introduce three new and very important data structures that build off of our mental model of lists and dictionaries:

1. array (NumPy): A NumPy array is a single-to-multi dimensional container that holds elements of the same data type. NumPy arrays are used for efficient storage, manipulation, and computation of large amounts of numerical data.

2. Series (Pandas): A Pandas Series is a one-dimensional labeled data structure provided by the pandas library in Python. It's similar to a column in a spreadsheet or a one-column database table. A Series consists of an ordered sequence of values and associated index labels, which can be used for efficient data manipulation and analysis.

3. DataFrame (Pandas): A Pandas DataFrame is a two-dimensional labeled data structure provided by the pandas library in Python. It's akin to a table in a relational database or a spreadsheet, where data is organized into rows and columns. DataFrames are versatile and widely used for data manipulation, analysis, and exploration.

In [105]:
import random
import numpy as np
import pandas as pd

In [126]:
# Generating a large list v. a large array of random numbers and multipling it by 2.
l = [random.random()*100 for _ in range(int(1e7))]
a = np.array(l)

@timeit
def multiply_by_n(d,n):
    if type(d) == list:
        return map(lambda x: x * 2, d)
    elif type(d) == np.array:
        return d * n
    return d
    
for d in [l,a]:
    multiply_by_n(d, 2)

multiply_by_n took 3.5762786865234375e-06 seconds to finish.
multiply_by_n took 1.6689300537109375e-06 seconds to finish.


In [136]:
# Creating a dataframe from a list of objects
template = {
    'color': ['red','green','blue','orange','black','white','gray'],
    'car_type': ['compact','sedan','truck','suv','crossover','van','hatchback'],
    'fuel_type': ['gas','diesel','electric','hybrid','hydrogen'],
    'year': [2023,2022,2021,2020,2019,2018,2017],
    'price_range': [10000, 15000, 20000, 25000, 30000, 35000, 40000, 45000, 50000],
    'condition': ['new', 'used'],
    'finance_options': ['cash','finance','lease'],
    'make': ['toyota','honda','ford','subaru','chevrolet','kia','gm']
}

dataset = []
for _ in range(int(1e4)):
    d = {}
    for key in template.keys():
        d[key] = template[key][random.randint(0,len(template[key])-1)]
    dataset.append(d)

df = pd.DataFrame(dataset)

# head(n) gives us the first n rows of the dataframe
df.head(10)

Unnamed: 0,color,car_type,fuel_type,year,price_range,condition,finance_options,make
0,blue,truck,gas,2022,35000,used,lease,honda
1,orange,truck,electric,2018,10000,new,lease,kia
2,white,suv,diesel,2023,20000,new,cash,gm
3,white,compact,electric,2020,20000,new,finance,gm
4,green,crossover,hybrid,2017,25000,new,finance,honda
5,green,truck,electric,2017,20000,new,finance,kia
6,white,crossover,hybrid,2023,45000,used,cash,toyota
7,green,hatchback,gas,2020,15000,new,cash,ford
8,white,crossover,diesel,2023,20000,used,cash,chevrolet
9,gray,hatchback,gas,2022,25000,used,cash,subaru


In [134]:
# A series would represent a single column or row of a dataframe
df.car_type

0         compact
1       hatchback
2       hatchback
3       crossover
4           sedan
          ...    
9995        sedan
9996          van
9997        truck
9998    hatchback
9999      compact
Name: car_type, Length: 10000, dtype: object