# Building Fast Queries on a CSV
---
We will imagine that we own an online laptop store and want to build a way to answer a few different business questions about our inventory.

We will use the `laptops.csv` file as our inventory. This CSV file was adapted from the [Laptop Prices dataset on Kaggle](https://www.kaggle.com/ionaskel/laptop-prices). We changed the IDs and made the prices integers.

Here is a brief description of the rows:

- **ID**: A unique identifier for the laptop.
- **Company**: The name of the company that produces the laptop.
- **Product**: The name of the laptop.
- **TypeName**: The type of laptop.
- **Inches**: The size of the screen in inches.
- **ScreenResolution**: The resolution of the screen.
- **CPU**: The laptop CPU.
- **RAM**: The amount of RAM in the laptop.
- **Memory**: The size of the hard drive.
- **GPU**: The graphics card name.
- **OpSys**: The name of the operating system.
- **Weight**: The laptop weight.
- **Price**: The price of the laptop.

The goal of this guided project is to create a class that represents our inventory. The methods in that class will implement the queries that we want to answer about our inventory. We will also preprocess that data to make those queries run faster.

Here are some queries that we will want to answer:

- Given a laptop id, find the corresponding data.
- Given an amount of money, find whether there are two laptops whose total price is that given amount.
- Identify all laptops whose price falls within a given budget.

## 1. Prerequisites

In [12]:
# Importing the library
import csv
import time                                                     
import random 

# Importing the data
with open('laptops.csv') as dat:
    reader = csv.reader(dat)
    rows = list(reader)
    header = rows[0]
    rows = rows[1:]

In [2]:
# Previewing the header
print(header,'\n')

# Previewing the first three data
for row in rows[:3]: print(row)

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price'] 

['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339']
['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898']
['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575']


## 2. Inventory Class

### 2.1. Creating the Class Object
We will now create the class that we need to represent our dataset, which is the laptop inventory.

In [5]:
# Creating the class
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]

# Trying out the class
test = Inventory('laptops.csv')

# Printing the length of the rows excluding the header
print('There are {} rows'.format(len(test.rows)))

There are 1303 rows


### 2.2. Creating a Function to Look Up a Laptop
The first thing that we will implement is a way to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds.

For this, we will write a function named `get_laptop_from_id()`. This function will take as argument the identifier of the laptop and return the full row of the laptop with that id.

In [51]:
# Adding the get_laptop_from_id function
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
    
    def get_laptop_from_id(self, laptop_id):
        for rows in self.rows:
            if rows[0] == laptop_id:
                return rows
        return None

# Testing the new function
print(Inventory('laptops.csv').get_laptop_from_id('3362737'))
print(Inventory('laptops.csv').get_laptop_from_id('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575']
None


The algorithm we have made is decent for the given dataset since it contains roughly 1,000 data points, but we can improve it further by using preprocessing the dataset. Just as a note, the time complexity so far is *O(R)* where *R* is the amount of input rows.

The idea is proceprocess the data into a dictionary where the keys are the IDs and the values the rows. Then, we will use that dictionary in the get_laptop_from_id() method. We can do this by:

1. Preprocess the data and create the dictionary in the `__init__()` method.
2. Re-implement the get_laptop_from_id() method. We will do it as a new method so that we can compare the two.

In [52]:
# Adding preprocessing
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            self.id_to_row = dict()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row 
    
    def get_laptop_from_id(self, laptop_id):
        for rows in self.rows:
            if rows[0] == laptop_id:
                return rows
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None

# Testing the new function
print(Inventory('laptops.csv').get_laptop_from_id_fast('3362737'))
print(Inventory('laptops.csv').get_laptop_from_id_fast('3362736'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]
None


The new function (`get_laptop_from_id_fast()`) has a time complexity of 1 or *O(1)*. The drawback is that the space complexity is higher since we are creating a new dicitonary which now stores *R* data points.

We will now test the run time for both functions using random IDs which will be generated using the `random.randint()` function from the `random` library. We will also measure the time taken by using the `time.time()` function from the `time` library.

In [53]:
# Generating random IDs
ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

# Assigning the class into a variable
inventory = Inventory('laptops.csv')

# Testing the duration for the "slow" function
total_time_no_dict = 0
for i in ids:
    start = time.time()
    inventory.get_laptop_from_id(i)
    end = time.time()
    total_time_no_dict += end - start
    
# Testing the duration for the fast function
total_time_dict = 0
for i in ids:
    start = time.time()
    inventory.get_laptop_from_id_fast(i)
    end = time.time()
    total_time_dict += end - start
    
# Printing the result
print('The time taken for the function without dictionary is {} seconds.'.format(total_time_no_dict))
print('The time taken for the function with dictionary is {} seconds.'.format(total_time_dict))

The time taken for the function without dictionary is 1.1074535846710205 seconds.
The time taken for the function with dictionary is 0.005177974700927734 seconds.


We can see that our suspicion has been confirmed, the function with the dictionary (preprocessing) is faster by roughly 250 times (there will be slight differences everytime we run it).

### 2.3. Creating a Look Up for Promotion
Sometimes, your store offers a promotion where you give a gift card. A customer can use the gift to buy up to two laptops. To avoid having to keep track of what was already spent, the gift card has a single time usage. This means that, even if there is leftover money, it cannot be used anymore.

You don't want to make a customer feel cheated, so whenever you issue a gift card, you want to make sure that there is at least one way to spend it in full. In other words, before issuing a gift card for D dollars, you want to make sure that either there is a laptop that costs exactly D dollars or two laptops whose costs add up to precisely D dollars.

We will write a function that checks whether it is possible to spend precisely that amoung by purchasing up to two laptops.

In [54]:
# Adding promotion lookup
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            self.id_to_row = dict()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row
    
    def get_laptop_from_id(self, laptop_id):
        for rows in self.rows:
            if rows[0] == laptop_id:
                return rows
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if int(row[-1]) == dollars:
                return True
        
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if int(self.rows[i][-1] + self.rows[j][-1]) == dollars:
                    return True
        return False

# Testing the new function
inventory = Inventory('laptops.csv')
for dol in(1000, 442):
    if inventory.check_promotion_dollars(dol):
        print('You can get a laptop or two for ${}.'.format(dol))
    else:
        print('You cannot get a laptop or two for ${}.'.format(dol))

You can get a laptop or two for $1000.
You cannot get a laptop or two for $442.


Since, we've learned how we can preprocess data to answer the kind of queries that we used in the `check_promotion_dollars()`. Let's implement this to make our code run faster.

We only care about whether or not there is a solution, thus, we can store all laptops prices in a set when we initialize the inventory. Then we can check in constant time whether there is a laptop with a given price.

In [55]:
# Adding promotion lookup with preprocessing
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            self.id_to_row = dict()
            self.prices = set()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row 
                self.prices.add(int(row[-1]))
    
    def get_laptop_from_id(self, laptop_id):
        for rows in self.rows:
            if rows[0] == laptop_id:
                return rows
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if int(row[-1]) == dollars:
                return True
        
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if int(self.rows[i][-1] + self.rows[j][-1]) == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False
    
# Testing the new function
inventory = Inventory('laptops.csv')
for dol in(1000, 442):
    if inventory.check_promotion_dollars_fast(dol):
        print('You can get a laptop or two for ${}.'.format(dol))
    else:
        print('You cannot get a laptop or two for ${}.'.format(dol))

You can get a laptop or two for $1000.
You cannot get a laptop or two for $442.


We will now compare the performance for both of the functions just like the look up functions above.

In [43]:
# Generating random IDs
prices = [random.randint(100, 5000) for _ in range(100)]

# Assigning the class into a variable
inventory = Inventory('laptops.csv')

# Testing the duration for the "slow" function
total_time_no_set = 0
for i in prices:
    start = time.time()
    inventory.check_promotion_dollars(i)
    end = time.time()
    total_time_no_set += end - start
    
# Testing the duration for the fast function
total_time_set = 0
for i in prices:
    start = time.time()
    inventory.check_promotion_dollars_fast(i)
    end = time.time()
    total_time_set += end - start
    
# Printing the result
print('The time taken for the function without set is {} seconds.'.format(total_time_no_set))
print('The time taken for the function with set is {} seconds.'.format(total_time_set))

The time taken for the function without set is 46.085529088974 seconds.
The time taken for the function with set is 0.0007078647613525391 seconds.


We can see that the function with set is almost 66,000 times faster than the triple looped function, thus we have made steady improvements to the function.

### 2.4. Finding Laptops Within a Budget
We want to write a method that efficiently answers the query: Given a budget of D dollars, find all laptops whose price it at most D.

If we sort all laptops by price, we can use binary search to identify the first laptop in the sorted list with a price larger than D. We need to make sure that our binary search finds the first one on the list. Then, the result of the query will consist of all laptops whose index in the sorted list is smaller than the index of the first laptop whose price is higher than D dollars.

We can use the `sorted()` function to help us sort the dataset based on the price before we conduct the binary search.

In [57]:
# Defining a function to extract the price
def row_price(row):
    return row[-1]

# Adding the binary search
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as dat:
            reader = csv.reader(dat)
            rows = list(reader)
            self.header = rows[0]
            self.rows = rows[1:]
            self.id_to_row = dict()
            self.prices = set()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row 
                self.prices.add(int(row[-1]))
            self.rows_by_price = sorted(self.rows, key = row_price)
    
    def get_laptop_from_id(self, laptop_id):
        for rows in self.rows:
            if rows[0] == laptop_id:
                return rows
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        return None
    
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if int(row[-1]) == dollars:
                return True
        
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if int(self.rows[i][-1] + self.rows[j][-1]) == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            if dollars - price in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0                                   
        range_end = len(self.rows_by_price) - 1                   
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2  
            price = self.rows_by_price[range_middle][-1]
            if price > target_price:
                range_end = range_middle
            else:
                range_start = range_middle + 1
        if self.rows_by_price[range_start][-1] <= target_price:                  
            return -1                                      
        return range_start

# Testing the new function
inventory = Inventory('laptops.csv')
for dol in(1000, 10000):
    temp = inventory.find_first_laptop_more_expensive(dol)
    if temp != -1:
        print('The laptop you are looking have the id of {}.'.format(temp))
    else:
        print('There are not laptop within your budget.')

The laptop you are looking for is 683.
There are not laptop within your budget.
