# Retrieve recipes by collection ids

Use functions imported from another file.

## Properties:
- input = id numbers for recipe collections<br>
where collections can be accessed at: allrecipes.com/recipes/id_no/<br><br>
- output = folder {timestamp now}
    - folder "collections" with individual collection html files
    - folder "recipes" with individual recipe html files

## Workflow

1. Provide recipe collection (cuisine) id's<br>
    - input = thought  
    - output = a list of id numbers<br><br>
1. Retrieve html pages that contain recipe collections - save as a file<br>
    - input 1 = a list of recipe collection id numbers
    - input 2 = a new filename to use when saving
    - output 1 = a dictionary with key = collection name; value = html page for that collection
    - output 2 = a file that contains html text for all collections<br><br>
1. Extract links to recipes from the recipe collection pages<br>
    - input = a dictionary with key = collection name; value = html page for that collection
    - output = a dictionary with key = collection name; value = a list of links that lead to recipes for that collection<br><br>
1. For each recipe type (collection): Retrieve html pages for the individual recipes - save as separate files for each recipe type<br>
    - input 1 = a dictionary with key = collection name; value = a list of links that lead to recipes for that collection
    - input 2 = a new filename to use when saving
    - output 1 = a dictionary with key = collection name; value = a list of html text for all recipes for that collection
    - output 2 = files that contain hmtl text for recipes, one file per collection, file name reflects the collection<br><br>
1. For each recipe type (collection): Read in the file that contains html recipes<br>
    - input = filename for the file that contains html text for all recipes
    - output = a dictionary with key = collection name; value = a list for all recipes within the collection, with element = html text for a recipe<br><br>
1. For each html recipe of a particular type: extract useful information - then join all information into one pandas dataframe (one row)<br>
(to be looped over all recipes for each collection; over all collections)<br>
    - input = a dictionary with key = collection name; value = a list for all recipes within the collection, with element = html text for a recipe
    - output = a pandas dataframe with columns = useful recipe features; only has a single row = individual recipe<br><br>
1. Make one giant pandas dataframe for all recipes from all collections - just add a column "collection" to individual recipe dataframes and join them together<br>
    - input = lots of pandas dataframes with columns = useful recipe features; each df only has a single row = individual recipe
    - output = a pandas dataframe with columns = useful recipe features including a column "collection"; rows = individual recipes from all collections<br><br>
1. For each recipe type (collection): Make wordclouds for interesting columns in the joint pandas dataframe<br>
    - input = a pandas dataframe with columns = useful recipe features; rows = individual recipes
    - output = wordcloud images, displayed & saved as jpeg files<br><br>

***
Separate jupyter notebook

1. Future: Do some language analysis for text columns<br>
    - input = a pandas dataframe with columns = useful recipe features; rows = individual recipes
    - output = tbd<br><br>
1. Future: Predict number of stars / ratings / reviews<br>
    - input = a pandas dataframe with columns = useful recipe features; rows = individual recipes
    - output = predicted number of stars / ratings / reviews for recipes in the pandas dataframe<br><br>

***
Extra notes<br>
1. Need to insert LOTS of exceptions or sth
1. Basically, if some field is empty, just use "NaN" and move on or sth

## Import packages / setup

In [86]:
# import public things

# general / random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipynb
import re # for string parsing / editing
import string # for string parsing / editing
from datetime import datetime
import time
import random
from pathlib import Path
import os
import ast

# for html
import requests # for getting html off the web
from bs4 import BeautifulSoup # for parsing html
import json

# for ML
from wordcloud import WordCloud, STOPWORDS
import snowballstemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import NMF

# import functions from my functions file
import ipynb.fs.full.functions as funcs

# update a module if it's been edited
# (this is just going around a jupyter feature where simply re-importing doesn't do anything)
# https://support.enthought.com/hc/en-us/articles/204469240-Jupyter-IPython-After-editing-a-module-changes-are-not-effective-without-kernel-restart
import importlib
importlib.reload(funcs)

# other useful settings
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5)

## The actual workflow

In [87]:
# create a new folder for the round

time_now = datetime.today().strftime('%Y-%m-%d_%H-%M')
round_folder = f'/home/bkotryna/ML_practice/allrecipes_project/data/{time_now}/'
path_round_folder = Path(round_folder)
path_round_folder.mkdir(parents=True, exist_ok=True)

print(f'Created directory for this round:\n{round_folder}')

Created directory for this round:
/home/bkotryna/ML_practice/allrecipes_project/data/2021-04-30_12-47/


In [88]:
# create a new folder for collection html's

collections_folder = f'/home/bkotryna/ML_practice/allrecipes_project/data/{time_now}/collections/'
path_collections_folder = Path(collections_folder)
path_collections_folder.mkdir(parents=True, exist_ok=True)

print(f'Created directory for collection html\'s:\n{collections_folder}')

Created directory for collection html's:
/home/bkotryna/ML_practice/allrecipes_project/data/2021-04-30_12-47/collections/


In [89]:
# create a new folder for recipe html's

recipes_folder = f'/home/bkotryna/ML_practice/allrecipes_project/data/{time_now}/recipes/'
path_recipes_folder = Path(recipes_folder)
path_recipes_folder.mkdir(parents=True, exist_ok=True)

print(f'Created directory for recipe html\'s:\n{recipes_folder}')

Created directory for recipe html's:
/home/bkotryna/ML_practice/allrecipes_project/data/2021-04-30_12-47/recipes/


In [90]:
# 1. Provide recipe collection (cuisine) id's

# input = thought
start_id = 720
stop_id = 722

# output
collection_ids = list(range(start_id, stop_id+1))

# just to check
print(f'Cuisine ids are: {collection_ids}')

Cuisine ids are: [720, 721, 722]


In [91]:
# 2. Retrieve html pages that contain recipe collections - save as a file

# input = collection_ids

# change into "collections" folder
os.chdir(path_collections_folder)
print(f'Changed into collections directory:\n{os.getcwd()}\n')

# retrieve & save collection html's
collection_pages_list = funcs.retrieve_collections_from_id_list(collection_ids)

# change back into allrecipes/ folder
os.chdir('../../..')
print(f'Changed back into project directory:/n{os.getcwd()}\n')

# just to check
print(f'This is how many recipe collection html\'s were retrieved: {len(collection_pages_list)}')

Changed into collections directory:
/home/bkotryna/ML_practice/allrecipes_project/data/2021-04-30_12-47/collections

Start retrieving collections:

Now retrieving collection id 720
Now retrieving collection id 721
Now retrieving collection id 722

Collection retrieval now finished.

Changed back into project directory:/n/home/bkotryna/ML_practice/allrecipes_project
This is how many recipe collection html's were retrieved: 3


In [92]:
# 3. Extract recipe ids from links to recipes from the recipe collection pages
# recipe collections aka cuisines aka recipe types

# input
collection_pages_list

# let's extract a list of links for each cuisine and add it to dictionary
# key = collection name
# value = a list of recipe links for that collection

recipe_ids_by_collection_dict = {}
for page in collection_pages_list:
    #page = BeautifulSoup(page)
    # get the collection name
    collection_name = funcs.extract_a_collection_name_from_an_html_page(page)
    # get recipe_ids
    ids_from_collection = funcs.extract_recipe_ids_from_all_links_from_html_file(page)    
    # add the list of links for this one particular cuisine to the dictionary
    recipe_ids_by_collection_dict[collection_name] = ids_from_collection

# output
# recipe_ids_by_collection_dict

In [93]:
# just to check
print(f'These are the collections we\'ve got:')
for key in recipe_ids_by_collection_dict:
    print(key)

#display(recipe_links_by_collection_dict)

These are the collections we've got
Dutch
French
German


In [94]:
# 4. For each recipe type (collection): Retrieve html pages for the individual recipes

# input
recipe_ids_by_collection_dict

# change into "recipes" folder
os.chdir(path_recipes_folder)
print(f'Changed into recipes directory:\n{os.getcwd()}\n')

# let's extract a list of recipe html's for each collection and add it to dictionary
# key = collection name
# value = a list of recipe html's for that collection

recipe_pages_by_collection_dict = {}
page_list = []
for collection in list(recipe_ids_by_collection_dict.keys()):
    print(f'\n\n{collection}:')
    
    id_list = recipe_ids_by_collection_dict[collection]
    
    page_sublist = funcs.retrieve_recipes_from_id_list(id_list)
    recipe_pages_by_collection_dict[collection] = page_sublist
    page_list += page_sublist
    
# just to check
print(f'Number of recipes retrieved is: {len(page_list)}')

# change back into allrecipes/ folder
os.chdir('../../..')
print(f'\nChanged back into project directory:\n{os.getcwd()}')

# output
# (1) recipe_pages_by_collection_dict
# (2) page_list with all pages for all recipes
# (3) a saved file for each recipe



Dutch:
Start retrieving recipes:

11433 - attempting to retrieve. That makes it 1 pages tried so far.
Success => 1 retrieved total

68583 - attempting to retrieve. That makes it 2 pages tried so far.
Success => 2 retrieved total

220626 - attempting to retrieve. That makes it 3 pages tried so far.
Success => 3 retrieved total

9524 - attempting to retrieve. That makes it 4 pages tried so far.
Success => 4 retrieved total

283744 - attempting to retrieve. That makes it 5 pages tried so far.
Success => 5 retrieved total

15683 - attempting to retrieve. That makes it 6 pages tried so far.
Success => 6 retrieved total

139069 - attempting to retrieve. That makes it 7 pages tried so far.
Success => 7 retrieved total

79797 - attempting to retrieve. That makes it 8 pages tried so far.
Success => 8 retrieved total

258330 - attempting to retrieve. That makes it 9 pages tried so far.
Success => 9 retrieved total

246572 - attempting to retrieve. That makes it 10 pages tried so far.
Success =

221361 - attempting to retrieve. That makes it 20 pages tried so far.
Success => 20 retrieved total

7915 - attempting to retrieve. That makes it 21 pages tried so far.
Success => 21 retrieved total

149686 - attempting to retrieve. That makes it 22 pages tried so far.
Success => 22 retrieved total

87484 - attempting to retrieve. That makes it 23 pages tried so far.
Success => 23 retrieved total

221383 - attempting to retrieve. That makes it 24 pages tried so far.
Success => 24 retrieved total

25194 - attempting to retrieve. That makes it 25 pages tried so far.
Success => 25 retrieved total

17402 - attempting to retrieve. That makes it 26 pages tried so far.
Success => 26 retrieved total

260685 - attempting to retrieve. That makes it 27 pages tried so far.
Success => 27 retrieved total

206120 - attempting to retrieve. That makes it 28 pages tried so far.
Success => 28 retrieved total

214863 - attempting to retrieve. That makes it 29 pages tried so far.
Success => 29 retrieved to