# Software Engineering Practices Quiz

## Quiz 1 - Refactor: Wine Quality Analysis
In this exercise, you'll refactor code that analyzes a wine quality dataset taken from the UCI Machine Learning Repository [here](https://archive.ics.uci.edu/ml/datasets/wine+quality). 

Each row contains data on a wine sample, including several physicochemical properties gathered from tests, as well as a quality rating evaluated by wine experts.

The code in this notebook first renames the columns of the dataset and then calculates some statistics on how some features may be related to quality ratings. Can you refactor this code to make it more clean and modular?

In [1]:
import pandas as pd
df = pd.read_csv('./data/winequality-red.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Renaming Columns
You want to replace the spaces in the column labels with underscores to be able to reference columns with dot notation. Here's one way you could've done it.

In [2]:
new_df = df.rename(columns={'fixed acidity': 'fixed_acidity',
                             'volatile acidity': 'volatile_acidity',
                             'citric acid': 'citric_acid',
                             'residual sugar': 'residual_sugar',
                             'free sulfur dioxide': 'free_sulfur_dioxide',
                             'total sulfur dioxide': 'total_sulfur_dioxide'
                            })
new_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


And here's a slightly better way you could do it. You can avoid making naming errors due to typos caused by manual typing. However, this looks a little repetitive. Can you make it better?

In [3]:
df_1 = df.copy()

labels = list(df.columns)
labels[0] = labels[0].replace(' ', '_')
labels[1] = labels[1].replace(' ', '_')
labels[2] = labels[2].replace(' ', '_')
labels[3] = labels[3].replace(' ', '_')
labels[5] = labels[5].replace(' ', '_')
labels[6] = labels[6].replace(' ', '_')
df_1.columns = labels

df_1.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### Renaming Columns - SOLUTION

In [4]:
# Using for loop
label_list = list(df.columns)
for i in range(len(label_list)):
    label_list[i] = label_list[i].replace(' ', '_')

df.columns = label_list
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
# Using list comprehension
df.columns = [label.replace(' ', '_') for label in df.columns]

### Analyzing Features
Now that your columns are ready, you want to see how different features of this dataset relate to the quality rating of the wine. A very simple way you could do this is by observing the mean quality rating for the top and bottom half of each feature. The code below does this for four features. It looks pretty repetitive right now. Can you make this more concise? 

You might challenge yourself to figure out how to make this code more efficient! But you don't need to worry too much about efficiency right now - we will cover that more in the next section.

In [6]:
median_alcohol = df_1.alcohol.median()
for i, alcohol in enumerate(df_1.alcohol):
    if alcohol >= median_alcohol:
        df_1.loc[i, 'alcohol'] = 'high'
    else:
        df_1.loc[i, 'alcohol'] = 'low'
df_1.groupby('alcohol').quality.mean()

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64

In [7]:
median_pH = df_1.pH.median()
for i, pH in enumerate(df_1.pH):
    if pH >= median_pH:
        df_1.loc[i, 'pH'] = 'high'
    else:
        df_1.loc[i, 'pH'] = 'low'
df_1.groupby('pH').quality.mean()

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64

In [8]:
median_sugar = df_1.residual_sugar.median()
for i, sugar in enumerate(df_1.residual_sugar):
    if sugar >= median_sugar:
        df_1.loc[i, 'residual_sugar'] = 'high'
    else:
        df_1.loc[i, 'residual_sugar'] = 'low'
df_1.groupby('residual_sugar').quality.mean()

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64

In [9]:
median_citric_acid = df_1.citric_acid.median()
for i, citric_acid in enumerate(df_1.citric_acid):
    if citric_acid >= median_citric_acid:
        df_1.loc[i, 'citric_acid'] = 'high'
    else:
        df_1.loc[i, 'citric_acid'] = 'low'
df_1.groupby('citric_acid').quality.mean()

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64

### Analyzing Features - SOLUTION

In [10]:
def compare_quality(df, col_name):
    '''
    Compare mean quality rating for the top and bottom half of a feature.
    
    INPUT
    df - the dataframe to be analysed.
    col_name - a string of column name that need to be compared
    
    OUTPUT
    None
    '''
    median_col = df[col_name].median()
    for i, col in enumerate(df[col_name]):
        if col >= median_col:
            df.loc[i, col_name] = 'high'
        else:
            df.loc[i, col_name] = 'low'

In [11]:
for feature in df.columns[:-1]:
    compare_quality(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')

fixed_acidity
high    5.726061
low     5.540052
Name: quality, dtype: float64 

volatile_acidity
high    5.392157
low     5.890166
Name: quality, dtype: float64 

citric_acid
high    5.822360
low     5.447103
Name: quality, dtype: float64 

residual_sugar
high    5.665880
low     5.602394
Name: quality, dtype: float64 

chlorides
high    5.507194
low     5.776471
Name: quality, dtype: float64 

free_sulfur_dioxide
high    5.595268
low     5.677136
Name: quality, dtype: float64 

total_sulfur_dioxide
high    5.522981
low     5.750630
Name: quality, dtype: float64 

density
high    5.540574
low     5.731830
Name: quality, dtype: float64 

pH
high    5.598039
low     5.675607
Name: quality, dtype: float64 

sulphates
high    5.898917
low     5.351562
Name: quality, dtype: float64 

alcohol
high    5.958904
low     5.310302
Name: quality, dtype: float64 



## Quiz 2 - Optimizing Code: Common Books
Here's the code your coworker wrote to find the common book ids in `books_published_last_two_years.txt` and `all_coding_books.txt` to obtain a list of recent coding books.

In [12]:
import time
import pandas as pd
import numpy as np

In [13]:
with open('./data/books_published_last_two_years.txt') as f:
    recent_books = f.read().split('\n')
    
with open('./data/all_coding_books.txt') as f:
    coding_books = f.read().split('\n')

In [14]:
start = time.time()
recent_coding_books = []

for book in recent_books:
    if book in coding_books:
        recent_coding_books.append(book)

print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 10.020211935043335 seconds


### Tip #1: Use vector operations over loops when possible

Use numpy's `intersect1d` method to get the intersection of the `recent_books` and `coding_books` arrays.

In [15]:
start = time.time()
recent_coding_books =  np.intersect1d(recent_books, coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.02195143699645996 seconds


### Tip #2: Know your data structures and which methods are faster
Use the set's `intersection` method to get the common elements in `recent_books` and `coding_books`.

In [16]:
start = time.time()
recent_coding_books =  set(recent_books).intersection(coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

96
Duration: 0.004956245422363281 seconds


## Quiz 3 - Optimizing Code: Holiday Gifts
In the last example, you learned that using vectorized operations and more efficient data structures can optimize your code. Let's use these tips for one more example.

Say your online gift store has one million users that each listed a gift on a wish list. You have the prices for each of these gifts stored in `gift_costs.txt`. For the holidays, you're going to give each customer their wish list gift for free if it is under 25 dollars. Now, you want to calculate the total cost of all gifts under 25 dollars to see how much you'd spend on free gifts. Here's one way you could've done it.

In [17]:
with open('./data/gift_costs.txt') as f:
    gift_costs = f.read().split('\n')

gift_costs = np.array(gift_costs).astype(int)  # convert string to int

In [18]:
start = time.time()

total_price = 0
for cost in gift_costs:
    if cost < 25:
        total_price += cost * 1.08  # add cost after tax

print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

32765421.23999867
Duration: 7.000284671783447 seconds


Here you iterate through each cost in the list, and check if it's less than 25. If so, you add the cost to the total price after tax. This works, but there is a much faster way to do this. Can you refactor this to run under half a second?

### Refactor Code
**Hint:** Using numpy makes it very easy to select all the elements in an array that meet a certain condition, and then perform operations on them together all at once. You can them find the sum of what those values end up being.

In [19]:
start = time.time()

total_price =  np.sum(gift_costs[gift_costs < 25]) * 1.08
print(total_price)
print('Duration: {} seconds'.format(time.time() - start))

32765421.240000002
Duration: 0.05987238883972168 seconds
