## Survey Data Collection Notebook

## Objectives
* Fetch data from Kaggle and save it as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs
* Kaggle JSON file - the authentication token

## Outputs
* Generate Dataset: outputs/datasets/collection/BreakfastSurvey.csv

---

# Install Python packeges in the notebook

## Change working directory
We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Guest-Survey-Analysis-to-Improve-Hotel-Breakfast/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/Guest-Survey-Analysis-to-Improve-Hotel-Breakfast'

## Fetch data from Kaggle

Recognize token in the session

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "zoltnnyrdi/breakfast-survey"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

breakfast-survey.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  # && rm kaggle.json

Archive:  inputs/datasets/raw/breakfast-survey.zip
  inflating: inputs/datasets/raw/BreakfastSurvey.csv  


---

## Load and Inspect Kaggle data

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/BreakfastSurvey.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,taste,breakfast,varety,service,price,staff,appearance
0,0,1,No,3,2,1,1,4
1,1,1,No,4,3,2,2,4
2,2,2,No,4,2,3,3,5
3,3,3,No,4,3,4,4,5
4,4,1,"Yes, next time not",4,3,2,2,4


Delete old indexes

In [15]:
df.drop(labels="Unnamed: 0", axis=1, inplace=True)

DataFrame Summary

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53820 entries, 0 to 53819
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   taste       53820 non-null  int64 
 1   breakfast   53820 non-null  object
 2   varety      53820 non-null  int64 
 3   service     53820 non-null  int64 
 4   price       53820 non-null  int64 
 5   staff       53820 non-null  int64 
 6   appearance  53820 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 2.9+ MB


Check unique values in `breakfast` column

In [19]:
df["breakfast"].unique()

array(['No', 'Yes, next time not', 'Yes, again'], dtype=object)

Converting `breakfast` to nummerical

In [21]:
df["breakfast"] = df["breakfast"].replace({"Yes, again":2,
                                           "Yes, next time not":1,
                                           "No":0
                                           })

Check `breakfast` data type

In [22]:
df["breakfast"].dtype

dtype('int64')

## Save output file

In [23]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BreakfastSurvey.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
