# Building Fast Queries on a CSV

For this project, we will imagine that we own an online laptop store and want to build a way to answer a few different business questions about our inventory.

The dataset is found [here](https://www.kaggle.com/ionaskel/laptop-prices) with some edits to IDs and integer prices.

In [63]:
import csv

with open('laptops.csv') as f:
    reader = csv.reader(f)
    rows = list(reader)
    header = rows[0]
    rows = rows[1:]
    
print(header)
print('------')

for i in range(5):
    print(rows[i])
    print('------')

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
------
['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339']
------
['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898']
------
['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575']
------
['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537']
------
['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel C

## Inventory Class

The goal of this project is to create a class that represents our inventory. The methods in that class will implement the queries that we want to answer about our inventory. We will also pre-process that data to make those queries run faster.

Here are some queries that we will want to answer:
* Given a laptop id, find the corresponding data.
* Given an amount of money, find whether there are two laptops whose total price is that given amount.
* Identify all laptops whose price falls within a given budget.

In [64]:
class Inventory():

    def __init__(self, csv_filename: str) -> None:
        '''
        Inventory should be initiated with a string for a csv file in the working directory
        '''
        
        # Open the file and generate the data
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            
        # Cast the price to an integer
        for row in self.rows:
            row[-1] = int(row[-1])
            
inventory = Inventory('laptops.csv')
print(inventory.header)
print('-----')
print(len(inventory.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
-----
1303


## Finding a Laptop from the ID

The first thing that we will implement is a way to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds.

This part will satisfy this query: Given a laptop id, find the corresponding data.

In [65]:
# To extend the functionality, we can make the class a subclass of itself
class Inventory(Inventory):
            
    def get_laptop_from_id(self, laptop_id: str) -> list[any] | None:
        '''
        Returns the row with the target laptop_id in element 0
        '''
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id('3362737'))
print('-----')
print(inventory.get_laptop_from_id('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
-----
None


## Improving Id Lookups

In our dataset, we only have about 1,300 laptops, so it might seem unnecessary to improve the performance of this query. However, you have to imagine that this code could be used in situations where the inventory contains millions of rows. Also, if we perform a lot of queries, even on a small dataset, the slow query performance will start to add up. It might eventually become the bottleneck of the application.

Here we look to improve the performance.

In [66]:
class Inventory(Inventory):

    def __init__(self, csv_filename):
        
        # Open the file and generate the data
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            
        # Cast the price to an integer
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # Creates a dictionary to populate with the ids and rows
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
    
    def get_laptop_from_id_fast(self, laptop_id: str) -> list[any] | None:
        '''
        Same as get_laptop_from_id except using a dictionary created in the instance
        '''
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    

inventory = Inventory('laptops.csv')
print(inventory.get_laptop_from_id_fast('3362737'))
print('-----')
print(inventory.get_laptop_from_id_fast('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
-----
None


## Comparing the Performance

Let's experiment to compare the performance of the two methods. The idea is to generate random IDs using the random module. Then, use both methods to lookup these same IDs and compare using the time module.

In [67]:
import time
import random

id_values = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

# Method timing with no dictionary
total_time_no_dict = 0
for id in id_values:
    start = time.time()
    inventory.get_laptop_from_id(id)
    end = time.time()
    total_time_no_dict += end - start
    
# Method timing with a dictionary
total_time_dict = 0
for id in id_values:
    start = time.time()
    inventory.get_laptop_from_id_fast(id)
    end = time.time()
    total_time_dict += end - start
    
print('The amount of time with no dictionary is: ' + str(total_time_no_dict))
print('The amount of time with a dictionary is: ' + str(total_time_dict))
print('That is a difference of a factor of: ' + str(total_time_no_dict/total_time_dict))

The amount of time with no dictionary is: 0.9747114181518555
The amount of time with a dictionary is: 0.003222227096557617
That is a difference of a factor of: 302.4961894191639


## Two Laptop Promotion

Sometimes, your store offers a promotion where you give a gift card. A customer can use the gift to buy up to two laptops. To avoid having to keep track of what was already spent, the gift card has a single time usage. This means that, even if there is leftover money, it cannot be used anymore.

We will add a function that, given a dollar amount, checks whether it is possible to spend precisely that amount by purchasing up to two laptops. This part will satisfy this query: Given an amount of money, find whether there are two laptops whose total price is that given amount.

In [68]:
class Inventory(Inventory):
 
    def check_promotion_dollars(self, dollars: int) -> bool:    
        '''
        Takes as input the number of dollars and returns a boolean if one or two laptops totals that amount exactly
        '''
        
        # if one laptop fits the criteria
        for row in self.rows:
            if row[-1] == dollars:
                return True
        
        # if two laptops fit the criteria
        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        
        return False
    
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


# Optimizing Laptop Promotion

Since we only care about whether or not there is a solution, we can store all laptops prices in a set when we initialize the inventory. Then we can check in constant time whether there is a laptop with a given price.

In [69]:
class Inventory(Inventory):

    def __init__(self, csv_filename):
        
        # Open the file and generate the data
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            
        # Cast the price to an integer
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # Creates a dictionary to populate with the ids and rows
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
        # Creates a set to populate the prices
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
    
    def check_promotion_dollars_fast(self, dollars: int) -> bool:    
        '''
        Same as check_promotion_dollars but uses the set created in the instance
        '''
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollards - price in self.prices:
                return True
        
        return False
    
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


## Comparing Promotion Functions

Let's experiment to compare the performance of the two methods. The idea is to generate random prices using the random module. Then, use both methods to lookup these same prices and compare using the time module.

In [70]:
prices = [random.randint(100, 5000) for _ in range(100)]

inventory = Inventory('laptops.csv')

# Method with no set
total_time_no_set = 0
for price in prices:
    start = time.time()
    inventory.get_laptop_from_id(price)
    end = time.time()
    total_time_no_set += end - start
    
# Method with set
total_time_set = 0
for price in prices:
    start = time.time()
    inventory.get_laptop_from_id_fast(price)
    end = time.time()
    total_time_set += end - start
    
print('The amount of time with no set is: ' + str(total_time_no_set))
print('The amount of time with a set is: ' + str(total_time_set))
print('That is a difference of a factor of: ' + str(total_time_no_set/total_time_set))

The amount of time with no set is: 0.016696453094482422
The amount of time with a set is: 3.409385681152344e-05
That is a difference of a factor of: 489.72027972027973


## Finding Laptops Within a Budget

We want to write a method that efficiently answers the query: Given a budget of D dollars, find all laptops whose price it at most D.

If we sort all laptops by price, we can use binary search to identify the first laptop in the sorted list with a price larger than D. We need to make sure that our binary search finds the first one on the list. Then, the result of the query will consist of all laptops whose index in the sorted list is smaller than the index of the first laptop whose price is higher than D dollars.

This will answer the final of the initial set of queries: Identify all laptops whose price falls within a given budget.

### Laptops Affordable in Budget

First, we'll write a function that will answer which laptops are in budget.

In [71]:
class Inventory(Inventory):

    def __init__(self, csv_filename):
        
        # Open the file and generate the data
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            
        # Cast the price to an integer
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # Creates a dictionary to populate with the ids and rows
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
        # Creates a set to populate the prices
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
            
        # Creates a sorted list of rows by price
        self.rows_by_price = sorted(self.rows, key=lambda x: x[-1])
    
    def find_first_laptop_more_expensive(self, target_price: int) -> int:
        '''
        Uses binary search to find the index for which laptops are greater than target_price
        Any laptop at an index smaller than the above index is within target_price and can be recommended
        Returns -1 if all laptops are lower than the target_price
        '''
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                   
        
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            
            if price > target_price:
                range_end = range_middle
            else:
                range_start = range_middle + 1
        
        if self.rows_by_price[range_start][-1] <= target_price:                  
            return -1                                   
        
        return range_start

inventory = Inventory('laptops.csv')
print(inventory.find_first_laptop_more_expensive(1000))
print(inventory.find_first_laptop_more_expensive(10000))

683
-1


### Laptops within a Budget Range

Now, we'll find laptops in the given range.

In [72]:
class Inventory(Inventory):
    
    def find_laptops_within_budget(self, min_price: int, max_price: int) -> tuple[int,int]:
        '''
        Uses find_first_laptop_more_expensive to get and returns a tuple of indices of laptops within budget 
        for list slicing
        '''
        
        min_index = self.find_first_laptop_more_expensive(min_price) - 1
        max_index = self.find_first_laptop_more_expensive(max_price)
        
        return min_index, max_index
    
inventory = Inventory('laptops.csv')
print(inventory.find_laptops_within_budget(1000,1500))

(682, 999)


## Cheapest Laptop with Certain Characteristics

Sometimes, a customer wants a laptop with some characteristics such as, for instance, 8GB of RAM and a 256GB hard drive. For simplicity, we will focus only on the amount of RAM and hard drive capacity.

In [77]:
class Inventory(Inventory):

    def __init__(self, csv_filename):
        
        # Open the file and generate the data
        with open(csv_filename) as f:
            reader = csv.reader(f)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            
        # Cast the price to an integer
        for row in self.rows:
            row[-1] = int(row[-1])
        
        # Creates a dictionary to populate with the ids and rows
        self.id_to_row = {}
        for row in self.rows:
            self.id_to_row[row[0]] = row
            
        # Creates a set to populate the prices
        self.prices = set()
        for row in self.rows:
            self.prices.add(row[-1])
            
        # Creates a sorted list of rows by price
        self.rows_by_price = sorted(self.rows, key=lambda x: x[-1])
        
        # Creates new columns for each row of RAM (index 7) and memory (index 8)
        self.header.append('int_ram')
        self.header.append('int_memory')
        for row in self.rows:
            nums = ''
            for char in row[7]:
                if char.isdigit():
                    nums += char
                else:
                    break
            row.append(int(nums))
            nums = ''
            for char in rows[8]:
                if char.isdigit():
                    nums += char
                else:
                    break
            row.append(int(nums))
        
        # Creates a sorted list of rows by ram then hard drive then price
        self.rows_by_ram_memory_price = sorted(self.rows, key = lambda x: (x[-2],x[-1]))
        
    def find_cheapest_laptop_by_ram_and_memory(self, ram: int, memory: int) -> list[any] | None:
        '''
        Using the sorted list rows_by_ram_memory_price, it finds the first match where the row's ram and then memory
        are equal or greater than arguments
        '''

        # Find ram
        for i, row in enumerate(self.rows_by_ram_memory_price):
            if row[-2] >= ram:
                if row[-1] >= memory:
                    return self.rows_by_ram_memory_price[i]
        return None
                
inventory = Inventory('laptops.csv')
print(inventory.find_cheapest_laptop_by_ram_and_memory(8,256))

['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', 1339, 8, 3824690]


## Conclusion

In this project, we created a class to represent the inventory of a laptop shop and discovered that preprocessing data allows for more effective queries.

We created queries for finding a laptop by id, price, and characteristics. Future work would be to add additional query functionality or optimize any queries further depending on further features.