# Building Fast Queries on a CSV

In this project, we'll imagine that we own an e-commerce store that sells laptops. The goal of our project will be to create a class that represents our inventory, and the methods in our class will implement queries to answer business questions about our inventory. This project will focus on time and space complexity of algorithms, preprocessing data to speed up algorithms, efficiently sorting data and searching that data, and using efficient algorithms. 

## The Dataset

We'll work with a CSV file that comes from the [Laptop Prices](https://www.kaggle.com/ionaskel/laptop-prices) dataset on Kaggle. In our `laptops.csv` file, the column IDs have been changed, and the prices have been converted to integers. Let's start by reading in our data, separating the header from the rows, and doing some initial exploration of the dataset.

In [1]:
import csv

with open('laptops.csv') as file:
    reader = csv.reader(file)
    data = list(reader)
    header = data[0]
    rows = data[1:]
    
print(header)
for i in range(5):
    print(rows[i])

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339']
['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898']
['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575']
['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537']
['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256GB SSD',

## Inventory Class

Next, we'll start constructing our inventory class, and read our inventory into self.header and self.rows by taking the name of the CSV file as an argument.

In [2]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows:         # Convert price to int
            row[-1] = int(row[-1])

In [3]:
# Testing our new class
inventory = Inventory('laptops.csv')
print(inventory.header)
print(len(inventory.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
1303


## Finding a Laptop From the ID

Throughout this project, we will be making various improvements to our `Inventory()` class, so we'll just copy the latest version of the class into a new cell and make improvements to it there. This way we can more easily keep track of the changes made.

The next method we'll create will be a search function that will take the laptop ID as an argument and return the entire row for that laptop. This way we can identify the laptop that corresponds to a customer's purchase slip.

In [4]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows: 
            row[-1] = int(row[-1])
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None

In [5]:
# Testing our new method
inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id('3362737')) # Found in the dataset
print(inventory.get_laptop_from_id('3362736')) # Not actually in the dataset

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


## Improving ID Lookups

The algorithm above looks at every single row to find the one we're looking for, so it has a time complexity of *O(Number of Rows)*. We can improve this if we efficiently preprocess the data by using a dictionary instead of a set.

In [6]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows: 
            row[-1] = int(row[-1])
        # Create the dictionary in the __init__() method
        self.id_to_row = {} 
        for row in self.rows:
            self.id_to_row[row[0]] = row # Use row ID as the key and entire row as the value
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    # Implement a new get_laptop_from_id() method so we can compare the two
    def get_laptop_from_id_fast(self, laptop_id): 
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None

In [7]:
# Testing our newest method
inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id_fast('3362737')) # Found in the dataset
print(inventory.get_laptop_from_id_fast('3362736')) # Not actually in the dataset

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


## Comparing Performance

The `get_laptop_from_id_fast()` implementation has a time complexity of *0(1)* by creating the dictionary and using more memory to store it.

Let's compare the performance of these two different methods. We'll do so by generating random IDs and then use both methods to lookup the same IDs and use the time module to measure the execution times.

In [8]:
import time
import random

random_ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

inventory = Inventory('laptops.csv')

elapsed_time_no_dict = 0
for id_number in random_ids:
    start = time.time()
    inventory.get_laptop_from_id(id_number)
    end = time.time()
    elapsed_time_no_dict += end - start
    
elapsed_time_dict = 0
for id_number in random_ids:
    start = time.time()
    inventory.get_laptop_from_id_fast(id_number)
    end = time.time()
    elapsed_time_dict += end - start
    
print(elapsed_time_no_dict)
print(elapsed_time_dict)

0.6787481307983398
0.002396821975708008


There's a significant difference in performance for our two methods. The dictionary method was more than 300 times faster. In our dataset, we only have about 1,300 rows, so it doesn't make such a big difference here. However, if we have millions of rows, or perform lots of queries, this difference adds up and could become a bottleneck to our application.

## Two Laptop Promotion

Let's imagine our store were to offer promotional single use only gift cards with the ability to purchase a maximum of two laptops. We wouldn't want our customers to feel cheated if they were unable to use the full balance received on two laptops in a single transaction, so we want to make sure that there is at least one way to spend it in full.

To do this, we'll write a function called `check_promotion_dollars()` that checks if it's possible to spend any given amount by purchasing two laptops.

In [9]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows: 
            row[-1] = int(row[-1])
        self.id_to_row = {} 
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id): 
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:     # Check if there's one laptop that matches the giftcard exactly
            if row[-1] == dollars: 
                return True
        for row1 in self.rows:    # Iterates over all pairs of rows to check if combined price matches the giftcard
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        return False

In [10]:
# Testing the new method
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


## Optimizing Two Laptop Promotion


In [11]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows: 
            row[-1] = int(row[-1])
        self.id_to_row = {} 
        for row in self.rows:
            self.id_to_row[row[0]] = row
        self.prices = set()        # Assign empty set
        for row in self.rows:      # Add the price in each row to the set
            self.prices.add(row[-1])
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id): 
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars: 
                return True
        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False

In [12]:
# Testing the faster method
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars_fast(1000))
print(inventory.check_promotion_dollars_fast(442))

True
False


## Comparing Promotion Method Performance

Let's check the difference in performance between these two variations that we just created.

In [13]:
random_prices = [random.randint(100, 5000) for _ in range(100)]

inventory = Inventory('laptops.csv')

elapsed_time_no_set = 0
for price in random_prices:
    start = time.time()
    inventory.check_promotion_dollars(price)
    end = time.time()
    elapsed_time_no_set += end - start

elapsed_time_set = 0
for price in random_prices:
    start = time.time()
    inventory.check_promotion_dollars_fast(price)
    end = time.time()
    elapsed_time_set += end - start
    
print(elapsed_time_no_set)
print(elapsed_time_set)

0.5359442234039307
0.00026798248291015625


## Finding Laptops Within a Budget

In [14]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
        self.header = rows[0]
        self.rows = rows[1:]
        for row in self.rows: 
            row[-1] = int(row[-1])
        self.id_to_row = {} 
        for row in self.rows:
            self.id_to_row[row[0]] = row
        self.prices = set()        
        for row in self.rows:      
            self.prices.add(row[-1])
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id): 
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars: 
                return True
        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False