# Assignment 2 - 19/11/2024

**Objective:** Learn and apply web scraping techniques using Python to extract data from web pages, focusing on understanding and utilizing various tools and methods. 
**Scope:** This assignment will involve scraping a simple, legally permissible website (e.g., a webpage 
displaying weather data, book listings, or movie reviews). Ensure the chosen website allows scraping by checking its robots.txt file and terms of service. 

**Tasks:** 
1. Choose a Website:
- Identify a website to scrape. Verify that the website permits scraping. 
- Briefly describe the website and the data you intend to scrape. 
3. Set Up Your Environment: 
- Install necessary Python libraries (requests, Scrapy, etc.). 
- Set up a Python script or Jupyter notebook for your scraping code.

In [3]:
pip install scrapy

Note: you may need to restart the kernel to use updated packages.


In [298]:
from scrapy import Selector
import requests
# I import a scrapy selector and requests
titles=[]
dates=[]
rates=[]
serves=[]
prep_times=[]
cook_times=[]
recipes=[]
#I create empty lists to put relevant info in it later.

url = 'https://yemek.com/tarif/'
#Then I define the url that I scrape
html = requests.get(url).content
# I create the string html containing the HTML source

sel = Selector(text = html)
# I create the selector object sel from html

print("Number of elements:", len(sel.xpath('//*')))
# I check the number of elements in the html document

Number of elements: 700


In [300]:
xpath_for_recipe_links = '//*[@id="__next"]/div/div[3]/main/section[2]/div[4]/div/div[1]/div/div/div/h4/a/@href'
recipe_links = sel.xpath(xpath_for_recipe_links).extract()
print(recipe_links)
#After finding the page where I will get the recipes, I create a common xpath by taking the xpaths of several recipes and check if I could get them.


['/tarif/mini-bal-kabagi-dolmasi/', '/tarif/palamut-eksilisi/', '/tarif/sinkonta/', '/tarif/ayva-dolmasi-2/', '/tarif/firinda-kereviz/', '/tarif/etsiz-kuru-fasulye/', '/tarif/perde-pilavi/', '/tarif/tarhana-corbasi/', '/tarif/firinda-palamut/']


3. Data Extraction: 
- Use the requests library to send an HTTP request and retrieve the content of the 
webpage. 
- Parse the webpage content using Scrapy to extract relevant data (e.g., titles, 
descriptions, ratings). 
4. Data Cleaning: 
- Clean and organize the scraped data into a readable and usable format. 
- Handle any missing or inconsistent data entries.

In [303]:
import numpy as np
for recipe_link in recipe_links:
    url1 = 'https://yemek.com' + recipe_link
#Then I create a for loop to go to each recipe page to get the info that I want. Therefore I define the base url and add the suffix for each recipe.

    html1 = requests.get(url1).content
    sel1 = Selector(text = html1)
    xpath_for_title = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[1]/div[1]/div[1]/div/span/text()'
    title = sel1.xpath(xpath_for_title).extract()
    if len(title)==2:
        if title[1] not in titles:
            titles.append(title[1])
    elif len(title)==1:
        if title[0] not in titles:
            titles.append(title[0])
    else:
        titles.append(np.nan)
    #Since there is a blank column on my titles, I create an if function to solve it.
    #If the length is 2 I get 1st, if it is 1 I get index 0, and in other cases, I mark it nan.
    dates.append(date)
    xpath_for_rate = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[2]/div[1]/div/div[2]/span[2]/text()'
    rate = sel1.xpath(xpath_for_rate).extract()
    rates.append(rate)
    xpath_for_serve = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[1]/section/div[1]/div[1]/div/span/text()'
    serve = sel1.xpath(xpath_for_serve).extract()
    serves.append(serve)
    xpath_for_prep_time = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[1]/section/div[1]/div[2]/div/span/text()'
    prep_time = sel1.xpath(xpath_for_prep_time).extract()
    prep_times.append(prep_time)
    xpath_for_cook_time = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[1]/section/div[1]/div[3]/div/span/text()'
    cook_time = sel1.xpath(xpath_for_cook_time).extract()
    cook_times.append(cook_time)
    xpath_for_recipe = '//*[@id="__next"]/div/div[3]/main/section[2]/div/div[1]/section/div[4]/div[2]/text()'
    recipe = sel1.xpath(xpath_for_recipe).extract()
    recipes.append(recipe)
    #Then I get some data about the recipe by defining their xpaths.

print(titles)
print(dates)
print(rates)
print(serves)
print(prep_times)
print(cook_times)
print(recipes)

['Mini Bal Kabağı Dolması Tarifi', 'Palamut Ekşilisi Tarifi', 'Sinkonta Tarifi', 'Ayva Dolması Tarifi', 'Fırında Kereviz Tarifi', 'Etsiz Kuru Fasulye Tarifi', 'Perde Pilavı Tarifi', 'Tarhana Çorbası Tarifi', 'Fırında Palamut Tarifi']
[['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022'], ['19 Ekim 2022']]
[['5'], ['5'], ['4.5'], ['4.5'], ['4.5'], ['5'], ['5'], ['4.5'], ['4.5']]
[['4 kişilik'], ['4 kişilik'], ['4 kişilik'], ['3 adet'], ['4 kişilik'], ['6 kişilik'], ['6 kişilik'], ['6 kişilik'], ['2 adet']]
[['30 dakika'], ['10 dakika'], ['15 dakika'], ['20 dakika'], ['15 dakika'], ['30 dakika'], ['30 dakika'], ['5 dakika'], ['5 dakika']]
[['50 dakika'], ['30 dakika'], ['25 dakika'], ['30 dakika'], ['45 dakika'], ['90 dakika'], ['1 saat'], ['20 dakika'], ['30 dakika']]
[[], [], ['Acı yağ yakılmış süzme yoğurtla birlikte servis edebilirsiniz.  '], [], ['Kerevizleri doğradıktan sonra hemen tarife başl

5. Data Storage: 
- Store the extracted data in a structured format like a CSV file or a Pandas DataFrame.

In [320]:
import pandas as pd
recipe_df = pd.DataFrame({
    "Recipe Name": titles,
    "Recipe Date": dates,
    "Serves": serves,
    "Preparation Time": prep_times,
    "Cooking Time": cook_times,
    "Recipe": recipes,
    
})
#I make a pandas data frame by merging all my lists.

recipe_df = recipe_df.fillna(np.nan)

recipe_df.to_excel("recipes_output.xlsx")
print("You have done it")
#Lastly I write it on an xlsx file.

You have done it
