# What is a Generator and why use it?

In [None]:
def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row
!wget -O data.csv https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
reader = csv_reader("data.csv")
print(next(reader))
print(next(reader))

Generators can be ieterated on value at a time so it will help to save memory when we are working with large amount of data.

# What is a Decorator and why use it?

In [1]:
def getName():
    return "Daniel"

def addTextToName(function):
    def wrapper():
        return "This is your name: {}".format(function())
    return wrapper

namePrinter = addTextToName(getName)

def wrapName(func):
    def wrapper():
        print("Your name is: {}".format(func()))
        print("Nice to meet you.")
    return wrapper

@wrapName
def getNameWrapped():
    return "Daniel"

print(getName())
print("=====")
print(namePrinter())
print("=====")
getNameWrapped()

Daniel
=====
This is your name: Daniel
=====
Your name is: Daniel
Nice to meet you.


Decorators allows to modify or add the fuctionality to existing function without modifying it. 

# What is list/dict comprehension and why use it?

In [2]:
salaries = {'Anne': 50000, 'Bert': 60000, 'Carl': 70000, 'Dom': 80000}
raisedSalaries = {"key_"+k:round(v*1.13,0) for (k,v) in salaries.items()}
print(raisedSalaries)

{'key_Anne': 56500.0, 'key_Bert': 67800.0, 'key_Carl': 79100.0, 'key_Dom': 90400.0}


In [3]:
something = [1,2,3]
a = [i for i in something if i<2]
a

[1]

Comprehension allows the creation of new iterable from the existing iterable by using shorter syntex.  

# You are multithreading a list of parallel tasks using a thread pool. What is t.join() used for? What is the purpose of using t.join() rather than skipping it. This program seems to work with or without the t.join(). Why do we still include it?

In [6]:
from threading import Thread
from queue import Queue
import time

def worker(args,q):
    time.sleep(1)
    print("done {}".format(args))
    q.put(1)
    return

workerList=[]
for i in range(3):
    q = Queue()
    t = Thread(target=worker,args=(i,q))
    t.start()
    workerList.append([q,t])

for i,workerPair in enumerate(workerList):
    workerPair[1].join()
   
print("ALL WORK DONE")

total=0
for i,workerPair in enumerate(workerList):
    total+=workerPair[0].get()
    
print("TOTAL={}".format(total))

done 1done 0done 2


ALL WORK DONE
TOTAL=3


One way to think of join() as a "hold" on the main thread -- it sort of de-threads your thread and executes sequentially in the main thread, before the main thread can continue. It assures that your thread is complete before the main thread moves forward. Note that this means it's ok if your thread is already finished before you call the join() -- the main thread is simply released immediately when join() is called.

# What is df.apply(myFunction, axis=1)?

In [7]:
import pandas as pd

def reverseName(row):
    text=row["name"]
    return "".join(reversed(list(text)))

df = pd.DataFrame(data={'name': ["Alice", "Bob"]})
df["reversed"]=df.apply(reverseName,axis=1)
display(df.head())

Unnamed: 0,name,reversed
0,Alice,ecilA
1,Bob,boB


The function will apply a function ('reversename') to each row in the dataframe because the axis argument is taken as 1.

# What operation is df1.merge(df2, left_on='lkey', right_on='rkey') doing? What would this be called in SQL?

In [9]:
df1 = pd.DataFrame({'name': ['Brian', 'Bill', 'Frank'],
                    'demerits': [2, 3, 5]})
df2 = pd.DataFrame({'name': ['Brian', 'Bill', 'Frank'],
                    'convictions': [6, 7, 8]})
df1.merge(df2, how='outer')

Unnamed: 0,name,demerits,convictions
0,Brian,2,6
1,Bill,3,7
2,Frank,5,8


Merge df1 and df2 on the lkey and rkey columns. It is similar to outer join in SQL

# What is faster, pd.concat([df1,df2,df3]) or a loop of df.append()? Explain your reasoning.

In [10]:
%%time
import random
import pandas as pd
numRows=10000

df = pd.DataFrame(columns=["age","gender"])
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    df=df.append(df2)
df.head()

Wall time: 13.5 s


Unnamed: 0,age,gender
0,17,M
0,117,F
0,4,F
0,53,F
0,116,F


In [11]:
%%time
import random
numRows=10000
resultArr = []
for _ in range(numRows):
    df2=pd.DataFrame(data={'age': [random.randint(0,120)], 'gender': [random.choice(["M","F"])]})
    resultArr.append(df2)

df=pd.concat(resultArr)
df.head()

Wall time: 5.69 s


Unnamed: 0,age,gender
0,47,M
0,20,M
0,71,M
0,95,F
0,67,M


The concate command will be faster as the append method in Pandas doesn't modify the original object. Instead it creates a new one with combined data. Because of involving creation and data buffer, its performance is not well. 

# What is the purpose of tools such as Flask and Django?

Flask and Django are open-source web frameworks used in Python.Django is best suited for developing large and complex web applications, while Flask is a lightweight, extensible framework that allows you to develop small web applications.

# Compare the purposes of 1,2, and 3:

1) Flask/Django/Others

2) apache2/nginx

3) gunicorn/other WSGI

1. Flask and Django are web frameworks used in Python to devolop web applications.
2. Software like Nginx and Apache handle http requests, analyze them, and then hand back the corresponding documents to be viewed in a visitor’s browser.
3. Gunicorn implements the Web Server Gateway Interface (WSGI), which is a standard interface between web server software and web applications. It translates HTTP requests into something Python can understand.

# You are writing a program that scrapes text from a long list of websites. How would you apply parallelism to speed up the scraping task?

We can use multiprocessing and asyncio libraries to parallelize requests to different websites or other pages on the same website

# You are writing a python3 program. When should you use Docker and when should you use VENV?

A virtualenv only encapsulates Python dependencies. A Docker container encapsulates an entire OS. With a Python virtualenv, you can easily switch between Python versions and dependencies, but you're stuck with your host OS. With a Docker image, you can swap out the entire OS - install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core.

# What is requirements.txt used for?

It stores information about all the libraries, modules, and packages which are used while developing a particular project.

# Why not use `git add .`

It can also add generated files, backups, and conf files with stuff you don't want added. 

# Compare matrix multiplication using the GPU, x86 CPU, and x86 vector coprocessor such as AVX2/SSE3. Why do these different kinds of hardware all exist in our personal computers?

Generally speaking GPUs are much faster than CPU at highly parallel simple tasks like multiplying big matrices but there are some problems in GPU computation:transfering data between normal RAM and graphics RAM takes time
loading/starting GPU programs takes some time so while multiplication itself may be 100 (or more) times faster, you might experience an actually much smaller speedup or even a slowdown.
There are more issues with GPUs having poor performance in comparison to CPUs like massive slowdowns on branching code.

# Compare and contrast Ubuntu and RHEL.

Ubuntu is a Linux based Operating System and belongs to the Debian family of Linux. Red Hat Enterprise Linux or RHEL, is a Linux-based operating system that is designed for businesses. Ubuntu is used for desktops or on server.RHEL can be used on desktops, on servers, in hypervisors or in the cloud.