# Homework

https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/02-regression/homework.md

## 2.18 Homework

### Dataset

In this homework, we will use the New York City Airbnb Open Data. You can take it from
[Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv)
or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv)
if you don't want to sign up to Kaggle.

The goal of this homework is to create a regression model for prediction apartment prices (column `'price'`).

### EDA

* Load the data.
* Look at the `price` variable. Does it have a long tail? 

### Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'price'`,
* `'minimum_nights'`,
* `'number_of_reviews'`,
* `'reviews_per_month'`,
* `'calculated_host_listings_count'`,
* `'availability_365'`

Select only them.

### Question 1

Find a feature with missing values. How many missing values does it have?


### Question 2

What's the median (50% percentile) for variable 'minimum_nights'?


### Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('price') is not in your dataframe.
* Apply the log transformation to the price variable using the `np.log1p()` function.


### Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?


### Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.


### Question 5 

* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)


> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.


### Question 6

* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?


## Submit the results

Submit your results here: https://forms.gle/2N9GkTr1AgNeZ8hD7.

If your answer doesn't match options exactly, select the closest one.

## Deadline

The deadline for submitting is 20 September 2021, 17:00 CET. After that, the form will be closed.




Libraries

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


In [4]:
s = '../input/new-york-city-airbnb-open-data/AB_NYC_2019.csv'
df = pd.read_csv(s)

In [5]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Question 1

- Find a feature with missing values. How many missing values does it have?

In [22]:
def get_numeric_features(df):
	# returns numeric features/columns in DataFrame
	# convert to list of columns via : get_numeric_features.columns.tolist()
	
	#  return df.select_dtypes(include=['int16', 'int32', 'int64', 'float16', 'float32', 'float64'])		
	return df.select_dtypes(include=np.number) # [shorter] 
#enddef

In [90]:
df_select = get_numeric_features(df).isnull()
#show `NAN` fraction for each column which contains `NAN`
MissSum = (
    df_select
    .sum() # sum up per column
    [lambda x : x>0] # only show those which have at least one NAN 
    .to_frame(name='Sum') # convert to dataframe 
    .assign(Percent=lambda x: x["Sum"] * 100 / len(df_select) ) # column to fraction
    .style.format({
          "Sum": "{:d}",
        "Percent": "{:.1f}"
    })
)
print('Question 1: Number of missing values: ')
display(MissSum)
print('Numeric features with  missing values:\n', 
      list(get_numeric_features(df).isnull().sum()[lambda x : x>0].index))

Question 1: Number of missing values: 


Unnamed: 0,Sum,Percent
reviews_per_month,10052,20.6


Numeric features with  missing values:
 ['reviews_per_month']


# Question 2

What's the median (50% percentile) for variable 'minimum_nights'?


In [7]:
df['minimum_nights'].median()

3.0

# Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('price') is not in your dataframe.
* Apply the log transformation to the price variable using the `np.log1p()` function.

In [85]:
def split_data(df ,test=0.2 ,val=0.0 ,y=None ,seed=0, y_op=None):
    np.random.seed(seed)

    n = len(df)
    
    n_test = int(test * n) if test > 0.0 and test <= 1.0 else 0 
    n_val = int(val * n) if val > 0.0 and val <= 1.0 else 0 
    n_train = n - (n_val + n_test)
    n_train = n_train if n_train > 0 else 0

    idx = np.arange(n)
    np.random.shuffle(idx) # shuffels index
    
    # X
    if True:
        # removes y-colum if provided
        X_all = df.drop([y], axis=1) if y is not None else df

        # creates df with randomized rows
        X_shuffled = X_all.iloc[idx]

        X_test = X_shuffled.iloc[n_train+n_val:].copy() if n_test > 0 else None
        X_val = X_shuffled.iloc[n_train:n_train+n_val].copy() if n_val > 0 else None
        X_train = X_shuffled.iloc[:n_train].copy() if n_train > 0 else None
    #endif
    
    # y
    if y is not None:
        y_shuffled = df[y].iloc[idx]
        y_test = y_shuffled.iloc[n_train+n_val:].copy().values if n_test > 0 else None
        y_val = y_shuffled.iloc[n_train:n_train+n_val].copy().values if n_val > 0 else None
        y_train = y_shuffled.iloc[:n_train].copy().values if n_train > 0 else None
        
        if y_op is not None:
            y_test = y_op(y_test) if n_test > 0 else None
            y_val = y_op(y_val) if n_val > 0 else None
            y_train = y_op(y_train) if n_train > 0 else None
        #endif
        
        return (X_train, y_train), (X_val ,y_val), (X_test, y_test)
    else:
        return X_train, X_val, X_test
    #endif
    
#enddef

In [91]:
xy_train, xy_val, xy_test = split_data(
    get_numeric_features(df)
    ,test=0.2
    ,val=0.2
    ,y='price'
    ,seed=42 ,y_op=np.log1p
)

# Question 3

* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using `round(score, 2)`
* Which option gives better RMSE?


In [82]:
def get_rmse(y_pred, y_true):
    mse = ((y_pred - y_true) ** 2).mean()
    return np.sqrt(mse)
#enddef

def train_linear_regression_reg(X_df, y, r=0.0):
    # linear regression model with regularization 
    
    X = X_df.values    
    
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    reg = r * np.eye(XTX.shape[0])
    XTX = XTX + reg

    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    def model(x):
        return w[0] + np.dot(x.values, w[1:])
    #enddef
    
    return model
#enddef

def rmse_of_train(xy_train,xy_test,fill=0,reg=0):
    X_train ,y_train = xy_train
    X_test ,y_test = xy_test    
    model  = train_linear_regression_reg(X_train.fillna(fill),y_train,r=reg)
    y_pred = model(X_test.fillna(fill))
    return get_rmse(y_pred, y_test)    
#enddef    

In [92]:
X_train, y_train = xy_train
X_val ,y_val = xy_val
rpm_mean = X_train['reviews_per_month'].mean()
for fill in [0,rpm_mean]:
    rmse = rmse_of_train(xy_train,xy_val,fill=fill)
    print(f'fill={fill:3.2g}, rmse {round(rmse,2)} ({rmse:.4f})')
#endfor
    
# Question 3: Best way to fill NAs *
# -> Both are equally good for given rounding

fill=  0, rmse 0.64 (0.6432)
fill=1.4, rmse 0.64 (0.6428)


# Question 4

* Now let's train a regularized linear regression.
* For this question, fill the NAs with 0. 
* Try different values of `r` from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
* Use RMSE to evaluate the model on the validation dataset.
* Round the RMSE scores to 2 decimal digits.
* Which `r` gives the best RMSE?

If there are multiple options, select the smallest `r`.

In [93]:
fill = 0
for r in [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]:
    rmse = rmse_of_train(xy_train,xy_val,fill=fill,reg=r)
    print(f'r={r:6.1g} -> rmse = {round(rmse,2)} ({rmse:.5f})')
#endfor
# Answer: 0 has the smallest rmse

r=     0 -> rmse = 0.64 (0.64318)
r= 1e-06 -> rmse = 0.64 (0.64318)
r=0.0001 -> rmse = 0.64 (0.64323)
r= 0.001 -> rmse = 0.64 (0.64396)
r=  0.01 -> rmse = 0.66 (0.65599)
r=   0.1 -> rmse = 0.68 (0.67729)
r=     1 -> rmse = 0.68 (0.68218)
r=     5 -> rmse = 0.68 (0.68266)
r= 1e+01 -> rmse = 0.68 (0.68271)


# Question 5 
* We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
* Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
* For each seed, do the train/validation/test split with 60%/20%/20% distribution.
* Fill the missing values with 0 and train a model without regularization.
* For each seed, evaluate the model on the validation dataset and collect the RMSE scores. 
* What's the standard deviation of all the scores? To compute the standard deviation, use `np.std`.
* Round the result to 3 decimal digits (`round(std, 3)`)


> Note: Standard deviation shows how different the values are.
> If it's low, then all values are approximately the same.
> If it's high, the values are different. 
> If standard deviation of scores is low, then our model is *stable*.

In [94]:
fill = 0
reg=0
rmse_list = []
for seed in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    
    xy_train, xy_val, xy_test = split_data(
        get_numeric_features(df)
        ,test=0.2
        ,val=0.2
        ,y='price'
        ,seed=seed 
        ,y_op=np.log1p
    )
    
    rmse = rmse_of_train(xy_train,xy_val,fill=fill,reg=reg)
    
    rmse_list.append(rmse)
#     print(f'seed={seed} -> rmse = {round(rmse,2)} ({rmse:.5f})')    
#endfor

print(f'Question 5: STD of RMSE scores for different seeds: {round(np.std(np.array(rmse_list)),3)}')

Question 5: STD of RMSE scores for different seeds: 0.008


# Question 6
* Split the dataset like previously, use seed 9.
* Combine train and validation datasets.
* Fill the missing values with 0 and train a model with `r=0.001`. 
* What's the RMSE on the test dataset?


In [95]:
seed = 9
fill = 0
reg=0.001
xy_train, xy_val, xy_test = split_data(
    get_numeric_features(df)
    ,test=0.2
    ,val=0.0
    ,y='price'
    ,seed=seed 
    ,y_op=np.log1p
)
rmse = rmse_of_train(xy_train,xy_test,fill=fill,reg=reg)
print(f'Question 6: RMSE on test = {round(rmse,2)} ({rmse:.5f})')

Question 6: RMSE on test = 0.65 (0.64510)
