# Building Fast Queries On a CSV

In this project, we are given a csv containing laptop data:

**ID:** A unique identifier for the laptop.  
**Company:** The name of the company that produces the laptop.  
**Product:** The name of the laptop.  
**TypeName:** The type of laptop.  
**Inches:** The size of the screen in inches.  
**ScreenResolution:** The resolution of the screen.  
**CPU**: The laptop CPU.  
**RAM**: The amount of RAM in the laptop.  
**Memory**: The size of the hard drive.  
**GPU**: The graphics card name.  
**OpSys**: The name of the operating system.  
**Weight**: The laptop weight.  
**Price**: The price of the laptop.  

We want to build a way to answer certain questions about our laptop inventory in an efficient manner.

## Objectives
1. Analyze the time and space complexity of an algorithm.
2. Preprocesse data to significantly speed-up an algorithm.
3. Sort data and efficiently search sorted data.

## The Dataset

In [6]:
import csv
import time
import random

In [1]:
with open("laptops.csv") as file:
    file_contents = list(csv.reader(file))
    header = file_contents[0]
    rows = file_contents[1:]

print("Header is:", header)
print("\n")
print("First five rows:", rows[:5])

Header is: ['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


First five rows: [['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel C

## Inventory Class

In [2]:
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            file_contents = list(csv.reader(file))
            
        self.header = file_contents[0]
        self.rows = file_contents[1:]
        
        for row in self.rows:
            row[-1] = int(row[-1])
            
# Testing Inventory Class
laptop_inventory = Inventory('laptops.csv')
print("Laptop Inventory Header:", laptop_inventory.header)
print("\n")
print("Number of rows in Inventory:", len(laptop_inventory.rows))

Laptop Inventory Header: ['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


Number of rows in Inventory: 1303


## Finding a Laptop From the Id

The first thing that we will implement is a way to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds.

In [3]:
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            file_contents = list(csv.reader(file))
            
        self.header = file_contents[0]
        self.rows = file_contents[1:]
        
        for row in self.rows:
            row[-1] = int(row[-1])
    
    def get_laptop_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
# Testing get_laptop_id function
laptop_inventory = Inventory('laptops.csv')
print(laptop_inventory.get_laptop_id('3362737')) # This should find a laptop
print(laptop_inventory.get_laptop_id('3362736')) # This should not find a laptop

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


## Improving ID Lookups

The idea for this optimization is to preprocess the data into a dictionary where the keys are the IDs and the values are the rows. This will give us constant time lookups in cases where we want to find a row(s) that has a certain ID.

We will continue to copy and paste each iteration of the Inventory class so we can keep track of our changes.

In [5]:
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            file_contents = list(csv.reader(file))
            
        self.header = file_contents[0]
        self.rows = file_contents[1:]
        self.id_to_row = {}
        
        for row in self.rows:
            row[-1] = int(row[-1])
            self.id_to_row[row[0]] = row
            
    def get_laptop_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
# Testing get_laptop_id function
laptop_inventory = Inventory('laptops.csv')
print(laptop_inventory.get_laptop_from_id_fast('3362737')) # This should find a laptop
print(laptop_inventory.get_laptop_from_id_fast('3362736')) # This should not find a laptop

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


## Comparing the Performance

Let's compare the performance of both methods. We will generate some random IDs, then use each method to lookup these same IDs.

In [8]:
ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]
laptop_inventory = Inventory('laptops.csv')
total_time_no_dict = 0
for laptop_id in ids:
    start = time.time()
    laptop_inventory.get_laptop_id(laptop_id)
    end = time.time()
    total_time_no_dict += end - start
    
total_time_dict = 0
for laptop_id in ids:
    start = time.time()
    laptop_inventory.get_laptop_from_id_fast(laptop_id)
    end = time.time()
    total_time_dict += end - start
    
print("Total time elapsed when using get_laptop_id:", total_time_no_dict)
print("Total time elapsed when using get_laptop_from_id_fast:", total_time_dict)

Total time elapsed when using get_laptop_id: 0.7592034339904785
Total time elapsed when using get_laptop_from_id_fast: 0.002805471420288086


As we can see above, get_laptop_from_id_fast, which uses the data preprocessed into a dictionary, is about 271 times faster!

## Two Laptop Promotion

Let's say that there is a promotional gift card that allows a customer to buy up to 2 laptops. The catch to this gift card is that it can only be used once. Even if the customer uses the gift card and there is money left over, the customer can no longer use the gift card.

In order to make the customer feel like they are not being cheated, we want to make sure there is either one laptop that costs D (the amount of money on the gift card) dollars, or 2 laptops that add up to exactly D dollars.

We will implement this function now.

In [None]:
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            file_contents = list(csv.reader(file))
            
        self.header = file_contents[0]
        self.rows = file_contents[1:]
        self.id_to_row = {}
        
        for row in self.rows:
            row[-1] = int(row[-1])
            self.id_to_row[row[0]] = row
            
    def get_laptop_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        
        for row_outside in self.rows:
            for row_inside in self.rows:
                if row_outside[-1] + row_inside[-1] == dollars:
                    return True
        
        return False
    
laptop_inventory = Inventory('laptops.csv')
