# Q1:
You are developing a backend system for an application that processes videos uploaded by users to the server. On the server side, each instance of your program should predict what objects are in each frame, and return the result to another process within the application as a list of pairs of frame IDs and bounding box lications in JSON format. Each instance of your program has a 4GB RAM limit. The model that creates bounding boxes for objects exists in GPU RAM and so it does not consume general purpose RAM. Your program must return only one JSON representing the results for the whole video, not partial results. The incoming video file for each upload can be up to 100GB. Please describe how you would feed the data into your model, and feed the resulting predictions inot the JSON response.

# A1:
![alt text](diagram.png)

A simplistic strategy for handling vast amounts of media data.

1. The user submits a video to the platform.
2. There are two ways we can obtain data: full client-side offloading to the server-side offloading and client-side streaming
3. On the third, actions need to be taken to optimise memory usage and shorten processing times. The simplest method is to divide the data into equal-sized blocks, say ranging from 1 to 2GB.

Several issues will be resolved at once by doing this:
1) We will be able to track errors more precisely and determine which block the process occurred in, if any.
2) Inform the system and keep track of which block contains which data.

4. Using Pub/Sub, we may populate the data on the instances after we have divided the data into blocks.

5.Kafka as pub/sub will allow us to track each task and perform actions similar to a task log, which will notify the system of which blocks have completed their work. We can glue the data together in the right order if we know how many tasks we have and what their IDs are.

# Q2:
You are multithreading a list of parallel tasks using a thread pool. What is t.join() used for? What is the purpose of using t.join() rather than skipping it. This program seems to work with or without the t.join(). Why do we still include it?

In [1]:
from threading import Thread
from queue import Queue
import time

def worker(args,q):
    time.sleep(1)
    print("done {}".format(args))
    q.put(1)
    return

workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])

for i,workerPair in enumerate(workerList):
    workerPair[1].join()
    
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

done 1done 0
done 2

ALL WORK DONE
TOTAL=3


# A2:
1. What is t.join() used for?
Wait for the thread to finish. This stops the calling thread from moving forward until the thread whose join() method is being called terminates, either normally or as a result of an unhandled exception, or until the optional timeout occurs.

The join() function in Python is used to wait for a thread to finish running. This is helpful if you want to synchronise threads or make sure that all threads have finished their work before continuing.

There are a few situations where using join() in multithreading makes more sense than skipping a process. To begin with, you must use join() to access the thread's results if your application depends on them. Second, by making sure that every thread has performed its task before moving on, using join() enables you to avoid race problems. Last but not least, synchronisation between threads can be implemented using the join() function.

Overall, using join() is generally considered best practice when working with multiple threads in Python. It helps ensure accuracy and avoid race conditions while also allowing for easy synchronization between different parts of your code


# Q3
What is faster, pd.concat([df1,df2,df3]) or a loop of df.append()? Explain your reasoning.

In [None]:
%%time
import random
import pandas as pd
numRows=10000

df = pd.DataFrame(columns=["age","gender"])
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    df=df.append(df2)
df.head()

Wall time: 8.55 s


Unnamed: 0,age,gender
0,110,M
0,75,M
0,116,F
0,9,M
0,4,F


In [None]:
%%time
import random
numRows=10000
resultArr = []
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    resultArr.append(df2)

df=pd.concat(resultArr)
df.head()

Wall time: 3.52 s


Unnamed: 0,age,gender
0,109,F
0,62,M
0,113,M
0,43,M
0,50,M


# A3:
1. Let's compare the speed of pd.concat([df1, df2, df3]) to a loop of df.append().

pd.concat is faster than using a loop with df.append in most cases. The main reason for this is that pd.concat uses parallel processing while loops do not utilize all cores on the CPU (only 1 core is used). In addition, when concatenating many DataFrames together, pd.concat can be much faster than loops because it doesn't have to copy data multiple times like loops do

# Q4
Compare the purposes of 1,2, and 3:

1) Flask/Django/Others

2) apache2/nginx

3) gunicorn/other WSGI

# A4:

1. Compare Flask, Django and FastAPI

Flask is a microframework that doesn't require much boilerplate code. It's easy to get started with and great for small projects. However, because it's so lightweight it lacks some features that are found in other frameworks like Django (e.g., an ORM). 

Django is a batteries-included framework that includes everything you need to build complex web applications. It has an ORM (object-relational mapper) which makes working with databases much easier than if you were using raw SQL queries. However, all of this functionality comes at the cost of more complexity - Django can be overwhelming for beginners compared to something like Flask . 

 Fast API is relatively new but it has already gained popularity due its performance advantages over other frameworks thanks to its use of asynchronous request handling . If your project needs top performance , then Fast API would be worth considering despite its steep learning curve .
 
2. Compare apache2 and nginx
One of the most widely used web servers worldwide is Apache. It has a lengthy history and is incredibly reliable. It is also incredibly adaptable, making it simple for you to change the configuration of your server. The best option for high-traffic websites may not necessarily be Apache, though, as it can be fairly resource-intensive.

Due to its efficiency and speed, Nginx, a more recent web server, is swiftly gaining favour. It is perfect for busy websites with lots of traffic because it consumes less memory than Apache. Nginx can scale better as your site grows since it can handle more concurrent connections than Apache.

3. Compare gunicorn and other WSGI
For UNIX, Gunicorn is aWSGI HTTP server. It is an adaptation of Ruby's Unicorn project's pre-fork worker architecture. The Gunicorn server is easy to use, uses few server resources, and is fairly quick. It is also broadly interoperable with other web frameworks.

The popular WSGI servers uWSGI and mod wsgi are also available (for Apache). Each has advantages and disadvantages of their own. For instance, mod wsgi interacts with Apache but may not function as well as other solutions under high demand while uWSGI is very flexible but can be difficult to set up.


# Q5
You are writing a program that scrapes text from a long list of websites. How would you apply parallelism to speed up the scraping task?

# A5:
There are several ways to use parallelism in Python to speed up the scraping process. Utilizing the multiprocessing module is one option. You can build numerous processes with this module, each of which can carry out a different task. The join() method can then be used to wait for each process to complete before moving on with your code. Threading is an alternative method. With the help of this module, you may build threads, which operate similarly to processes but share some data. Additionally, you may utilise the lock object from this module to guarantee that only one thread is ever running a specific block of code.

# Q6
You are writing a python3 program. When should you use Docker and when should you use VENV?

# A6:
When it comes to choosing between using Docker and VENV for development environments, there are a few distinct points of view. The following advantages and disadvantages can help you choose which is best for your project:

Docker: 
1. All dependencies are isolated from the rest of the system, so there is no risk of contaminating the overall Python installation. 
2. You can quickly share your environment with others without worrying about compatibility issues. Since Docker must be installed independently from VENV, it may be trickier to set up initially than VENV.

VENV: 
1. It is built into the Python standard library, making it simple to use and requiring no additional setup procedures. 2. Everyone on your team must have compatible versions of Python installed for things to operate as intended; otherwise, unexpected problems could occur as a result of incompatibilities between versions.

Which one ought to you use then? In the end, everything relies on the demands and preferences you have for your particular project. Use Docker if you require isolation or want a simple way to share your environment with others. VENV might be a preferable option if simplicity is important and/or installation compatibility amongst colleagues' systems is not a concern.